BM3D denoising filter for VapourSynth, implemented in CUDA



Copyright© 2021 WolframRhodium

BM3D denoising filter for VapourSynth, implemented in CUDA


Please check VapourSynth-BM3D.


  • CPU with AVX support.

  • CUDA-enabled GPU(s) of compute capability 5.0 or higher.

  • GPU driver 450 or newer.

The minimum requirement on compute capability is 3.0, which requires manual compilation (specifying nvcc flag -gencode arch=compute_30,code=sm_30).

The _rtc version compiles code at runtime. It requires GPU driver 465 or newer and has dependencies on nvrtc64_112_0.dll/ and nvrtc-builtins64_113.dll/


bm3dcuda[_rtc].BM3D(clip clip[, clip ref=None, float[] sigma=3.0, int[] block_step=8, int[] bm_range=9, int radius=0, int[] ps_num=2, int[] ps_range=4, bint chroma=False, int device_id=0, bool fast=True])
  • clip:
    The input clip. Must be of 32 bit float format. Each plane is denoised separately if chroma is set to False.

  • ref:
    The reference clip. Must be of the same format, width, height, number of frames as clip.
    Used in block-matching and as the reference in empirical Wiener filtering, i.e. bm3d.Final / bm3d.VFinal.

  • sigma:
    The strength of denoising for each plane.
    The strength is similar (but not strictly equal) as VapourSynth-BM3D due to differences in implementation. (coefficient normalization is not implemented, for example)
    Default [3,3,3].

  • block_step, bm_range, radius, ps_num, ps_range:
    Same as those in VapourSynth-BM3D.
    If chroma is set to True, only the first value is in effect.
    Otherwise an array of values may be specified for each plane.

  • chroma:
    CBM3D algorithm. clip must be of YUV444PS format.
    Y channel is used in block-matching of chroma channels. Default False.

  • device_id:
    Set GPU to be used.
    Default 0.

  • fast:
    Multi-threaded copy between CPU and GPU at the expense of 4x memory consumption.
    Default True.


  • bm3d.VAggregate should be called after temporal filtering, as in VapourSynth-BM3D.


GPU memory consumptions:
(ref ? 4 : 3) * (chroma ? 3 : 1) * (fast ? 4 : 1) * (2 * radius + 1) * size_of_a_single_frame

Compilation on Linux

Standard version

  • g++ 11 (or higher) is required to compile source.cpp, while nvcc 11.3 only supports g++ 10 or older.

  • Unused nvcc flags may be removed. Documentation for -gencode

cd source

nvcc -o kernel.o -c --use_fast_math --std=c++17 -gencode arch=compute_50,code=\"sm_50,compute_50\" -gencode arch=compute_52,code=sm_52 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_86,code=\"sm_86,compute_86\" -t 0 --compiler-bindir g++-10

g++-11 source.cpp kernel.o -o -shared -fPIC -I/usr/local/cuda-11.3/include -I/usr/local/include -L/usr/local/cuda-11.3/lib64 -lcudart_static --std=c++20 -march=native -O3

RTC version

cd rtc_source

g++-11 source.cpp -o -shared -fPIC -I /usr/local/cuda-11.3/include -I /usr/local/include -L /usr/local/cuda-11.3/lib64 -lnvrtc -lcuda -Wl,-rpath,/usr/local/cuda-11.3/lib64 --std=c++20 -march=native -O3
  • Apparently RC3-rtc build crashes

    Apparently RC3-rtc build crashes

    I just tried to use rtc build and it crashes vsedit silently. gdb says that it's a segfault.

    Starting program: C:\vs-r52.1-port\vsedit.exe
    [New Thread 17896.0x4fc8]
    [New Thread 17896.0x2a78]
    Thread 1 received signal SIGSEGV, Segmentation fault.
    0x00007ffeb5796984 in ?? ()
    opened by DJATOM 12
  • Deterministic output

    Deterministic output

    Greetings, thank you for this plugin.

    I have noticed that this filter has some form of nondeterminism, unfortunately this complicates my workflow. Any chance this is something that could be looked into?

    Linux 5.12.8 CUDA 11.3.0-2 Nvidia 465.31 GTX 1080

    opened by NSQY 7
  • Reducing blocking?

    Reducing blocking?

    During some simple tests, I noticed that higher value of block_step will make more blocking when sigma is high, and block_step=8 will cause a few visible blocking even when sigma is relatively low (~5).

    As mawen mentioned in this issue:, using smaller block_step in final estimate will decrease some artifacts (including blockiness).

    Is it possible to do in this implementation?

    opened by Mr-Z-2697 5
  • Backport bm3dcpu to use C++17 only

    Backport bm3dcpu to use C++17 only

    C++20 might be too recent as its support is not yet widespread.

    This PR backports it to be usable with only C++17.

    Also replaced the use of a deprecated VS API.

    opened by AkarinVS 1
  • Lacking group_size parameter

    Lacking group_size parameter

    It can improve denoiseing quality sometimes, please consider implement it. BTW, is there any technical difficult point with making it adjustable? There should be a similar value internally, right?

    opened by Mr-Z-2697 6
  • ParallelFilter race condition

    ParallelFilter race condition

    Sometimes, when using BM3DCUDA along with other CUDA filters, I go thru the following error:

    x265 [WARN]: detected ParallelFilter race condition on last row

    I have read that it usually happens with CUDA, in some rare threads on web.

    As example, when using the script:

    SetFilterMTMode("DEFAULT_MT_MODE", 2)
    DGSource("F:\In\Cowboy Bebop\26.dgi",ct=0,cb=0,cl=236,cr=236)
    DGTelecide(mode=1, pthresh=3.5)
    BM3D_CUDA(sigma=3, radius=2)
    fmtc_bitdepth (bits=10,dmode=8)
    neo_f3kdb(range=15, Y=65, Cb=40, Cr=40, grainY=0, grainC=0, sample_mode=2, blur_first=true, dynamic_grain=false, mt=false, keep_tv_range=true)

    Is there anything that you can do?

    opened by tormento 6
  • Linux Build Fails from AUR

    Linux Build Fails from AUR

    When building from AUR, I get this build error

    ==> Starting build()...
    -- The CXX compiler identification is GNU 11.1.0
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++ - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Error at /usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:179 (message):
      Failed to find nvcc.
      Compiler requires the CUDA toolkit.  Please set the CUDAToolkit_ROOT
    Call Stack (most recent call first):
      source/CMakeLists.txt:3 (project)
    opened by mysteryx93 11
  • Possible lower than 32 bit??

    Possible lower than 32 bit??

    Your plugin very fast but only if you use single bm3d. When bm3d use to mask for other denoiser (other denoiser only have cpu ver), it's very slow cause bm3d only support 32 bit. If you add support other bits lower than 32 bit, i think it's will faster

    opened by kedaitinh12 8
  • RGB2OPP with radius?

    RGB2OPP with radius?

    With radius=0, I can give a RGB clip that auto-gets converted to OPP format. With radiu=1, it fails. If I try to manually convert with RGB2OPP, it throws: "Python exception: BM3D_RTC: only constant format 32bit float input supported"

    Is it possible to run in OPP format with radius=1?

    opened by mysteryx93 3
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Alex Kotovsky 2 Nov 29, 2021
We implemented our own sequential version of GA, PSO, SA and ACA using C++ and the parallelized version with CUDA support

We implemented our own sequential version of GA, PSO, SA and ACA using C++ (some using Eigen3 as matrix operation backend) and the parallelized version with CUDA support. All of them are much faster than the popular lib scikit-opt.

Aron751 4 May 7, 2022
Source Code for SIGGRAPH Asia 2021 Paper "Ensemble Denoising for Monte Carlo Renderings"

Ensemble Denoising Source Code for SIGGRAPH Asia 2021 Paper Ensemble Denoising for Monte Carlo Renderings. Paper | Code | Talk Slides Dependencies A C

郑少锟 52 Nov 25, 2022
Source Code for SIGGRAPH Asia 2021 Paper "Ensemble Denoising for Monte Carlo Renderings"

Ensemble Denoising Source Code for SIGGRAPH Asia 2021 Paper Ensemble Denoising for Monte Carlo Renderings. Paper | Code | Talk Slides Dependencies A C

郑少锟 52 Nov 25, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 92 Sep 27, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 37 Jul 27, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation

Lattice Boltzmann simulation I am conscious of being only an individual struggling weakly against the stream of time. But it still remains in my power

Long Nguyen 17 Mar 1, 2022
Tiny CUDA Neural Networks

This is a small, self-contained framework for training and querying neural networks. Most notably, it contains a lightning fast "fully fused" multi-layer perceptron as well as support for various advanced input encodings, losses, and optimizers.

NVIDIA Research Projects 1.8k Nov 25, 2022
HIPIFY: Convert CUDA to Portable C++ Code

Tools to translate CUDA source code into portable HIP C++ automatically

ROCm Developer Tools 202 Dec 1, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
Cooperative primitives for CUDA C++.

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model

NVIDIA Corporation 1.3k Dec 1, 2022
CUDA-accelerated Apriltag detection and pose estimation.

Isaac ROS Apriltag Overview This ROS2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publishes their poses,

NVIDIA Isaac ROS 46 Dec 1, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 59 Dec 5, 2022
CUDA Custom Buffers and example blocks

gr-cuda CUDA Support for GNU Radio using the custom buffer changes introduced in GR 3.10. Custom buffers for CUDA-enabled hardware are provided that c

GNU Radio 5 Aug 17, 2022
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Yan haixu 190 Dec 2, 2022
FoxRaycaster, optimized, fixed and with a CUDA option

Like FoxRaycaster(link) but with a nicer GUI, bug fixes, more optimized and with CUDA. Used in project: Code from FoxRaycaster, which was based on thi

Błażej Roszkowski 2 Oct 21, 2021
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 58 Nov 7, 2022
C++ Implementation of "An Equivariant Filter for Visual Inertial Odometry", ICRA 2021

EqF VIO (Equivariant Filter for Visual Inertial Odometry) This repository contains an implementation of an Equivariant Filter (EqF) for Visual Inertia

null 60 Nov 15, 2022