RDO BC1-7 GPU texture encoders

Overview

bc7enc - Fast BC1-7 GPU texture encoders with Rate Distortion Optimization (RDO)

This repo contains fast texture encoders for BC1-7. All formats support a simple post-processing transform on the encoded texture data designed to trade off quality for smaller compressed file sizes using LZ compression. Significant (10-50%) size reductions are possible. The BC7 encoder also supports a "reduced entropy" mode using the -e option which causes the output to be biased/weighted in various ways which minimally impact quality, which results in 5-10% smaller file sizes with no slowdowns in encoding time.

Currently, the entropy reduction transform is tuned for Deflate, LZHAM, or LZMA. The method used to control the rate-distortion tradeoff is the classic Lagrangian multiplier RDO method, modified to favor MSE on very smooth blocks. Rate is approximated using a fixed Deflate model. The post-processing transform applied to the encoded texture data tries to introduce the longest match it can into every encoded output block. It also tries to continue matches between blocks and (specifically for codecs like LZHAM/LZMA/Zstd) it tries to utilize REP0 (repeat) matches.

You can see examples of the RDO BC7 encoder's current output here. Some examples on how to use the command line tool are on my blog, here.

This repo contains both bc7e.ispc and its distantly related but weaker 4 mode only non-ispc variant, bc7enc.cpp. The -U command line option enables bc7e.ispc, otherwise you get bc7enc.cpp. bc7e supports all BC7 modes and features, but doesn't yet support reduced entropy BC7 encoding. bc7enc.cpp supports optional reduced entropy encoding (using -e with the command line tool). RDO BC7 is supported when using either encoder, however.

The next major focus will be improving the default smooth block handling and improving rate distorton performance.

This repo was originally derived from bc7enc.

Compiling

This build has been tested with MSVC 2019 x64 and clang 6.0.0 under Ubuntu v18.04.

To compile with bc7e.ispc (on Linux this requires Intel's ISPC compiler to be in your path - recommended):

cmake -D SUPPORT_BC7E=TRUE .
make

To compile without BC7E:

cmake .
make

Note the MSVC build and Linux builds enable OpenMP for faster compression.

Examples

To encode to non-RDO BC7 using BC7E, highest quality, using perceptual (scaled YCbCr) colorspace error metrics:

./bc7enc blah.png -U -u6 -s

To encode to non-RDO BC7 using BC7E, highest quality, linear RGB(A) metrics:

./bc7enc blah.png -U -u6

To encode to RDO BC7 using BC7E, highest quality, lambda=.5, allow 2 matches instead of 1 per block for higher effectiveness, linear metrics (perceptual colorspace metrics are always automatically disabled when -z is specified):

./bc7enc blah.png -U -u6 -z.5 -zn

To encode to RDO BC7 using BC7E, high quality, lambda=.5, linear metrics, with significantly faster encoding time (sacrificing compression effectiveness due to -zc16):

./bc7enc blah.png -U -u4 -z.5 -zc16

To encode to non-RDO BC7 using entropy reduced or quantized/weighted BC7 (no slowdown vs. non-RDO bc7enc.cpp for BC7, slightly reduced quality, but 5-10% better LZ compression, only uses 2 or 4 BC7 modes):

./bc7enc blah.png -e

To encode to RDO BC7 using the entropy reduction transform combined with reduced entropy BC7 encoding, with a slightly larger window size than the default which is 128 bytes:

./bc7enc -zc256 blah.png -e -z1.0

Same as before, but higher compression (allow 2 matches per block instead of 1):

./bc7enc -zc256 blah.png -e -z1.0 -zn

Same, except disable ultra-smooth block handling:

./bc7enc -zc256 blah.png -e -z1.0 -zu

To encode to RDO BC7 using the entropy reduction transform at lower quality, combined with reduced entropy BC7 encoding, with a slightly larger window size than the default which is 128 bytes:

./bc7enc -zc256 blah.png -e -z2.0

To encode to RDO BC7 using the entropy reduction transform at higher effectivenes using a larger window size, without using reduced entropy BC7 encoding:

./bc7enc -zc1024 blah.png -z1.0

To encode to RDO BC7 using the entropy reduction transform at higher effectivenes using a larger window size, with a manually specified max smooth block max error scale:

./bc7enc -zc1024 blah.png -z2.0 -zb30.0

To encode to RDO BC7 using the entropy reduction transform at higher effectivenes using a larger window size, using only mode 6 (more block artifacts, but better rate-distortion performance as measured by PSNR):

./bc7enc -zc1024 blah.png -6 -z1.0 -e

To encode to BC1:

./bc7enc -1 blah.png

To encode to BC1 with Rate Distortion Optimization (RDO) at lambda=1.0:

./bc7enc -1 -z1.0 blah.png

The -z option controls lambda, or the rate vs. distortion tradeoff. 0 = maximum quality, higher values=lower bitrates but lower quality. Try values [.25-8].

To encode to BC1 with RDO, with RDO debug output, to monitor the percentage of blocks impacted:

./bc7enc -1 -z1.0 -zd blah.png

To encode to BC1 with RDO with a higher then default smooth block scale factor:

./bc7enc -1 -z1.0 -zb40.0 blah.png

Use -zb1.0 to disable smooth block error scaling completely, which increases RDO performance but can result in noticeable artifacts on smooth/flat blocks at higher lambdas.

Use -zc# to control the RDO window size in bytes. Good values to try are 16-8192. Use -zt to disable RDO multithreading.

To encode to BC1 with RDO at the highest achievable quality/effectiveness (this is extremely slow):

./bc7enc -1 -z1.0 -zc32768 blah.png

This sets the window size to 32KB (the highest setting that makes sense for Deflate). Window sizes of 2KB (the default) to 8KB are way faster and in practice are almost as effective. The maximum window size setting supported by the command line tool is 64KB, but this would be very slow.

For even higher quality per bit (this is incredibly slow):

./bc7enc -1 -z1.0 -zc32768 -zm blah.png

Dependencies

There are no 3rd party code or library dependencies. utils.cpp/.h is only needed by the example command line tool. It uses C++11. The individual .cpp files are designed to be easily dropped into other codebases.

For RDO post-processing of any block-based format: ert.cpp/.h. You provide this function an array of encoded blocks, an array of source/original 32bpp blocks, some parameters, and a pointer to a block decoder function for your format as a callback. It must return false if the passed in block data is invalid. This transform should work on other texture formats, such as ETC1/2, EAC, and ASTC. The ERT works on block sizes ranging from 1x1 to 12x12. This file has no dependencies.

For BC1-5 encoding/decoding: rgbcx.cpp/.h

For BC7 encoding: bc7enc.cpp/.h

For BC7 decoding: bc7decomp.cpp/.h

Comments
  • No documentation - no idea where to start

    No documentation - no idea where to start

    Hi,

    I am trying to use this to load and save from a filename

    Loading: BC7 dds input file from hard drive (with a filename) - output to RGBA raw uncompressed mem stream.

    Example: Loadbc7("c:\myfile.dds");

    Saving: RGBA raw uncompressed mem stream to BC7 dds output file (a filename - height and width need specifying to the output file) Savebc7("c:\myfile.dds", 512, 1024);

    Do you have any working examples of this?

    documentation 
    opened by DLPB2 1
  • MSVCP140D.dll missing

    MSVCP140D.dll missing

    I try to compile executable with running .bat file and then executing in console: cmake --build . But compiled executable works only on my machine. On others it asks for msvcp140d.dll. Looks like it is a debug library or smth. May somebody help me with compiling process?

    opened by rimpy-custom 1
  • Assert in bc7enc mode 6

    Assert in bc7enc mode 6

    When compressing a texture in debug mode, I encountered an assert. The following color block was being compressed:

    23, 25, 46, 255
    22, 24, 45, 255
    22, 25, 44, 255
    22, 25, 44, 255
    23, 25, 46, 255
    22, 24, 45, 255
    22, 25, 44, 255
    22, 25, 44, 255
    22, 24, 45, 255
    22, 24, 45, 255
    22, 25, 44, 255
    22, 25, 44, 255
    22, 24, 45, 255
    22, 24, 45, 255
    22, 25, 44, 255
    22, 25, 44, 255
    

    It was processing mode 6, and the assert was in set_block_bits() (val set to 67, asserting since it doesn't fit in 3 bits)

    The parameters I used were:

    • linear weights
    • m_max_partitions: BC7ENC_MAX_PARTITIONS
    • m_uber_level: 1
    • m_try_least_squares: true
    • m_mode17_partition_estimation_filterbank: false
    opened by akb825 1
  • bc7e: Improve accuracy of some solid color blocks encoding

    bc7e: Improve accuracy of some solid color blocks encoding

    Out of 16 million possible opaque solid color blocks, previous code was not encoding all of them perfectly.

    In slow,veryslow,slowest perceptual modes: 1038 colors (out of 16M) not encoded identically. In basic perceptual mode: 3056 colors (out of 16M) not encoded perfectly (most obvious being pure red: 255,0,0,255 decodes as 253,0,0,255).

    Similar issues are with alpha solid color blocks.

    First I tried adding a dedicated "detect if pure solid color block, encode all the solid color modes no matter the settings" path. E.g. default perceptual basic mode never uses modes 2,4 or 5, and neither of mode 1 or 6 can encode pure opaque red for example. Anyway, adding that path gave better results, similar to how "slow" modes do.

    But then I added optimal endpoint tables for modes 3 and 5, similar to how other endpoint tables are done. And that's when I found that all the opaque solid color blocks can be encoded with just mode 5 perfectly. So the final code change is fairly simple, just:

    • detect if it's a solid color block (done in the same place as detection of alpha).
    • if it is a solid color block, encode using mode 5, using optimal tables. No rotation is used.
    • initialization code computes tables for mode 5 now too, which does increase initialization time a bit though (before: 39ms, now: 48ms on my machine).

    I tested all 16M possible opaque colors, and a bunch of ones with alpha too, and they all can be encoded perfectly now.

    opened by aras-p 1
  • Pass block index to unpack callback so custom constraints can be validated

    Pass block index to unpack callback so custom constraints can be validated

    This passes the block index to the unpack callback so that the callback can check if an input-dependent constraint that the encoder was adhering to (like CVTT's strict punchthrough flag) was violated by the ERT candidate, and reject it as invalid if so.

    ASTC could also use this to reject any changes to void-extent blocks.

    opened by elasota 1
  • Encoding time incorrect on multicore machine

    Encoding time incorrect on multicore machine

    I'll update this more when I've tested on other platforms. Running on Debian I'm seeing this:

    Total encoding time: 32.043268 secs
    Total processing time: 32.212762 secs
    

    But the actual run was much, much faster. If I time the run (with time) I see:

    real	0m2.503s
    user	0m57.651s
    sys	0m4.223s
    

    The 2.5 seconds of wallclock time feels correct (this is a 144-core machine).

    Update: if I change the clock() calls to the required jumble of std::chrono incantations I get:

    Total encoding time: 0.400000 secs
    Total processing time: 0.417000 secs
    

    This was as millis so on a many core or fast machine misses the nuances. I could do a PR for this and switch to micros if this is of interest? Something like this:

    https://github.com/richgel999/bc7enc_rdo/compare/master...cwoffenden:bc7enc_rdo:mt-timer?expand=1

    (It's what I'm using to time the BC4/5 changes I made)

    opened by cwoffenden 0
  • BC4/5 fixes and performance improvements

    BC4/5 fixes and performance improvements

    This fixes #17 but goes further:

    Lots of text snipped, jump down to the next paragraph. Originally this expanded the internal endpoints to 14-bit, but in testing the RMSE and PSNR were always slightly worse even though the max error was reduced. These errors were higher due being calculated from the 8-bit PNG file, not the hardware's representation. Ryg's blog entry has a good explanation of the hardware.

    I simplified this commit to address the main issue, which was blocks with two (or few) values having errors in hardware due to one endpoint always being interpolated (which doesn't occur with an 8-bit software decoder). This is achieved by starting the search radius at zero and working outwards (0, -1, 1, -2, 2, etc.). Further, once we have zero error we take this block as the best available and exit early.

    This fixes the original issue, keeps the max error, RMSE and PSNR exactly the same, and improves performance. Some timings, using the default -hr5 radius:

    Original code:

    BC4
    
    flowers-2048x2048
    Total encoding time: 0.599000 secs
    Total processing time: 0.656000 secs
    
    quenza-2048x2048
    Total encoding time: 0.825000 secs
    Total processing time: 0.883000 secs
    
    BC5
    
    bunny-nmap-2048x2048
    Total encoding time: 0.446000 secs
    Total processing time: 0.510000 secs
    
    can-nmap-2048x2048
    Total encoding time: 0.342000 secs
    Total processing time: 0.398000 secs
    

    This commit:

    BC4
    
    flowers-2048x2048
    Total encoding time: 0.476000 secs
    Total processing time: 0.534000 secs
    
    quenza-2048x2048
    Total encoding time: 0.725000 secs
    Total processing time: 0.784000 secs
    
    BC5
    
    bunny-nmap-2048x2048
    Total encoding time: 0.214000 secs
    Total processing time: 0.271000 secs
    
    can-nmap-2048x2048
    Total encoding time: 0.212000 secs
    Total processing time: 0.268000 secs
    

    All timings were from the best of four runs. The biggest improvement was in normal maps since there are large areas with 2-3 values hovering around 127, and since the search radius is now growing outwards these are found early on.

    opened by cwoffenden 4
  • BC4/5 blocks with two values per channel are slightly off

    BC4/5 blocks with two values per channel are slightly off

    If a BC4 block has only two values, and the search radius is greater than zero, then the second value is always interpolated, and since the BC4/5 interpolation is done with more than 8-bits in hardware (see findings here) the resulting value is slightly off.

    If you start with the code here:

    https://github.com/richgel999/bc7enc_rdo/blob/e6990bc11829c072d9f9e37296f3335072aab4e4/rgbcx.cpp#L2801

    An example being: a block with values 126 and 127 and the default search radius of 3 will achieve a best_err of zero with endpoints 127 and 123 (with six interpolated values, MODE8 in the code). When interpolated as 8-bit, and with selectors of 0 and 2, this does correctly result in values of 126 and 127. But... since the hardware is using 14- or 16-bit interpolation then the resulting values are 126.43 and 127.0.

    This is small, agreed, but when mixed with solid blocks of 126 we get block artefacts as the encoder flips between solid and multi-value blocks, breaking down to blocks of 126.0 and 126.43.

    This came about because we have a normal map exported from 3D Coat with essentially a dimple noise texture on it. AFAIK it was worked on at 4k then exported at 2k, by which time the dimple texture end up being a few dots. Here's the result isolated with obvious block errors:

    rgbcx

    Ignoring there's clearly something amiss with the normal map, it does nicely highlight this issue. If this were BC3 alpha it wouldn't be affected, it's only BC4/5 with the extended interpolation bits.

    The easy fix (which I've already tried) is to clamp the search radius to zero when there are only two values. I think, though, a better fix may be to interpolate with more range in bc4_block::get_block_values() so that it works for more than two values (which I'll try next).

    opened by cwoffenden 1
  • AVX and AVX2 targets perform better with i32x4 width

    AVX and AVX2 targets perform better with i32x4 width

    The current compilation in CMakeLists.txt passes avx and avx2 as targets to ISPC. These default to avx1-i32x8 and avx2-i32x8 respectively. However, when timing conversion of a very large input image performance was significantly better when using avx1-i32x4 and avx2-i32x4. On a laptop with an i7-8750H, avx1-i32x8 was almost 2x slower than avx1-i32x4, while avx2-i32x8 was ~20% slower than avx2-i32x4. On a desktop with a Ryzen 2700X, i32x8 was almost 2x slower than i32x4 for both avx1 and avx2.

    A couple other interesting things of note:

    • On the i7, SSE4 and AVX1 performed identically.
    • On the Ryzen system, AVX2 was actually ~10% slower than AVX1, and AVX1 was very slightly (~2-3%) faster than SSE4. Some research seems to suggest that newer Ryzen models likely fare better with AVX2.

    In other words, at least with the hardware I have available to me switching to i32x4 for any AVX targets universally gives a significant performance boost. In my own project (which includes bc7enc_rdo as well as ISPCTextureCompressor for BC6H support) I switched to avx2-i32x4 for the AVX2 target and dropped the AVX1 target due to not being different enough from SSE4 to justify the extra compile time and binary size.

    Also worth noting that I saw a similar performance degradation when testing on an M1 Mac with using neon-i32x8 vs. neon-i32x4, which is what prompted me to check the difference in x86.

    opened by akb825 2
  • Constant alpha = 255 and/or color is not reproduced using BC7 transparency modes

    Constant alpha = 255 and/or color is not reproduced using BC7 transparency modes

    This is releated to this issue that still remains on BC7enc. I updated to BC7enc_rdo. This affects BC7 encoded textures. https://github.com/richgel999/bc7enc/issues/3

    When pbit = 0 is chosen in the following call, the alpha reconstruct can only go to 254. This doesn't match the fully opaque textures that we are processing. We don't have this issue with ETC/ASTC encoder. For now, we'll workaround at the shader level and assume 254/255.0 is opaque when BC7enc is used.

    I tried forcing use of pbit = 1 for opaque textures. This results in correct fully opaque textures, but on a red-black checkerboard, the pbit of 1 results in 255,1,1,255 as the final color, when it should be 255,0,0,255. This transparent mode that is chosen seems to be incapable of reproducing the original texture accurately. The pbit limits the reproduced rgba components - even pbit results in even, and odd pbit results in odd values. So bc7enc probably needs to support one of the opaque mode, where the color bits don't affect the alpha.

    static uint64_t find_optimal_solution(uint32_t mode, vec4F xl, vec4F xh, const color_cell_compressor_params *pParams, color_cell_compressor_results *pResults)
    {
    ....
    	  for (int p = pParams->m_has_alpha ? 0 : 1; p < 2; p++)
    
    opened by alecazam 2
Owner
Rich Geldreich
Rich Geldreich
4eisa40 GPU computing : exploiting the GPU to execute advanced simulations

GPU-computing 4eisa40 GPU computing : exploiting the GPU to execute advanced simulations Activities Parallel programming Algorithms Image processing O

Ecam 4MIN repositories 2 Jan 10, 2022
NVIDIA Texture Tools samples for compression, image processing, and decompression.

NVTT 3 Samples This repository contains a number of samples showing how to use NVTT 3, a GPU-accelerated texture compression and image processing libr

NVIDIA DesignWorks Samples 33 Nov 16, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 92 Sep 27, 2022
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 60.7k Dec 3, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 37 Jul 27, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
🐸 Coqui STT is an open source Speech-to-Text toolkit which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers

Coqui STT ( ?? STT) is an open-source deep-learning toolkit for training and deploying speech-to-text models. ?? STT is battle tested in both producti

Coqui.ai 1.6k Nov 29, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.8k Nov 26, 2022
CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU executio

CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU execution. The goal is to provide comprehensive inference features and be the most efficient and cost-effective solution to deploy standard neural machine translation systems such as Transformer models.

OpenNMT 370 Nov 23, 2022
Driver layer GPU libraries and tests for PSP2

PVR_PSP2 Driver layer GPU libraries and tests for PSP2 Currently this project include: Common and PSP2-specific GPU driver headers. Extension library

null 68 Nov 5, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
GPU ray tracing framework using NVIDIA OptiX 7

GPU ray tracing framework using NVIDIA OptiX 7

Shunji Kiuchi 26 Oct 25, 2022
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

Tencent 1.2k Nov 23, 2022
GPU miner for TON

"Soft" Pull Request rules Thou shall not merge your own PRs, at least one person should review the PR and merge it (4-eyes rule) Thou shall make sure

null 136 Oct 23, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 52 Nov 24, 2022
An efficient C++17 GPU numerical computing library with Python-like syntax

MatX - Matrix Primitives Library MatX is a modern C++ library for numerical computing on NVIDIA GPUs. Near-native performance can be achieved while us

NVIDIA Corporation 605 Nov 30, 2022
Code generation for automatic differentiation with GPU support.

Code generation for automatic differentiation with GPU support.

Eric Heiden 42 Oct 10, 2022
Docker files and scripts to setup and run VINS-FUSION-gpu on NVIDIA jetson boards inside a docker container.

jetson_vins_fusion_docker This repository provides Docker files and scripts to easily setup and run VINS-FUSION-gpu on NVIDIA jetson boards inside a d

Mohamed Abdelkader Zahana 20 Oct 12, 2022
A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

JinquanPan 56 Nov 28, 2022