BM3D denoising filter for VapourSynth, implemented in CUDA



Copyright© 2021 WolframRhodium

Please check VapourSynth-BM3D.


  • CPU with AVX support.

  • CUDA-enabled GPU(s) of compute capability 5.0 or higher.

  • GPU driver 450 or newer.

The minimum requirement on compute capability is 3.0, which requires manual compilation (specifying nvcc flag -gencode arch=compute_30,code=sm_30).

The _rtc version compiles code at runtime. It requires GPU driver 465 or newer and has dependencies on nvrtc64_112_0.dll/ and nvrtc-builtins64_113.dll/


bm3dcuda[_rtc].BM3D(clip clip[, clip ref=None, float[] sigma=3.0, int[] block_step=8, int[] bm_range=9, int radius=0, int[] ps_num=2, int[] ps_range=4, bint chroma=False, int device_id=0, bool fast=True])
  • clip:
    The input clip. Must be of 32 bit float format. Each plane is denoised separately if chroma is set to False.

  • ref:
    The reference clip. Must be of the same format, width, height, number of frames as clip.
    Used in block-matching and as the reference in empirical Wiener filtering, i.e. bm3d.Final / bm3d.VFinal.

  • sigma:
    The strength of denoising for each plane.
    The strength is similar (but not strictly equal) as VapourSynth-BM3D due to differences in implementation. (coefficient normalization is not implemented, for example)
    Default [3,3,3].

  • block_step, bm_range, radius, ps_num, ps_range:
    Same as those in VapourSynth-BM3D.
    If chroma is set to True, only the first value is in effect.
    Otherwise an array of values may be specified for each plane.

  • chroma:
    CBM3D algorithm. clip must be of YUV444PS format.
    Y channel is used in block-matching of chroma channels. Default False.

  • device_id:
    Set GPU to be used.
    Default 0.

  • fast:
    Multi-threaded copy between CPU and GPU at the expense of 4x memory consumption.
    Default True.


  • bm3d.VAggregate should be called after temporal filtering, as in VapourSynth-BM3D.


GPU memory consumptions:
(ref ? 4 : 3) * (chroma ? 3 : 1) * (fast ? 4 : 1) * (2 * radius + 1) * size_of_a_single_frame

Compilation on Linux

Standard version

  • g++ 11 (or higher) is required to compile source.cpp, while nvcc 11.3 only supports g++ 10 or older.

  • Unused nvcc flags may be removed. Documentation for -gencode

cd source

nvcc -o kernel.o -c --use_fast_math --std=c++17 -gencode arch=compute_50,code=\"sm_50,compute_50\" -gencode arch=compute_52,code=sm_52 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_86,code=\"sm_86,compute_86\" -t 0 --compiler-bindir g++-10

g++-11 source.cpp kernel.o -o -shared -fPIC -I/usr/local/cuda-11.3/include -I/usr/local/include -L/usr/local/cuda-11.3/lib64 -lcudart_static --std=c++20 -march=native -O3

RTC version

cd rtc_source

g++-11 source.cpp -o -shared -fPIC -I /usr/local/cuda-11.3/include -I /usr/local/include -L /usr/local/cuda-11.3/lib64 -lnvrtc -lcuda -Wl,-rpath,/usr/local/cuda-11.3/lib64 --std=c++20 -march=native -O3
    Apparently RC3-rtc build crashes

    I just tried to use rtc build and it crashes vsedit silently. gdb says that it's a segfault.

    Starting program: C:\vs-r52.1-port\vsedit.exe
    [New Thread 17896.0x4fc8]
    [New Thread 17896.0x2a78]
    Thread 1 received signal SIGSEGV, Segmentation fault.
    0x00007ffeb5796984 in ?? ()
  • Deterministic output

    Deterministic output

    Greetings, thank you for this plugin.

    I have noticed that this filter has some form of nondeterminism, unfortunately this complicates my workflow. Any chance this is something that could be looked into?

    Linux 5.12.8 CUDA 11.3.0-2 Nvidia 465.31 GTX 1080

  • Reducing blocking?

    Reducing blocking?

    During some simple tests, I noticed that higher value of block_step will make more blocking when sigma is high, and block_step=8 will cause a few visible blocking even when sigma is relatively low (~5).

    As mawen mentioned in this issue:, using smaller block_step in final estimate will decrease some artifacts (including blockiness).

    Is it possible to do in this implementation?

  • Backport bm3dcpu to use C++17 only

    Backport bm3dcpu to use C++17 only

    C++20 might be too recent as its support is not yet widespread.

    This PR backports it to be usable with only C++17.

    Also replaced the use of a deprecated VS API.

  • Lacking group_size parameter

    Lacking group_size parameter

    It can improve denoiseing quality sometimes, please consider implement it. BTW, is there any technical difficult point with making it adjustable? There should be a similar value internally, right?

  • ParallelFilter race condition

    ParallelFilter race condition

    Sometimes, when using BM3DCUDA along with other CUDA filters, I go thru the following error:

    x265 [WARN]: detected ParallelFilter race condition on last row

    I have read that it usually happens with CUDA, in some rare threads on web.

    As example, when using the script:

    SetFilterMTMode("DEFAULT_MT_MODE", 2)
    DGSource("F:\In\Cowboy Bebop\26.dgi",ct=0,cb=0,cl=236,cr=236)
    DGTelecide(mode=1, pthresh=3.5)
    BM3D_CUDA(sigma=3, radius=2)
    fmtc_bitdepth (bits=10,dmode=8)
    neo_f3kdb(range=15, Y=65, Cb=40, Cr=40, grainY=0, grainC=0, sample_mode=2, blur_first=true, dynamic_grain=false, mt=false, keep_tv_range=true)

    Is there anything that you can do?

  • Linux Build Fails from AUR

    Linux Build Fails from AUR

    When building from AUR, I get this build error

    ==> Starting build()...
    -- The CXX compiler identification is GNU 11.1.0
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++ - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Error at /usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:179 (message):
      Failed to find nvcc.
      Compiler requires the CUDA toolkit.  Please set the CUDAToolkit_ROOT
    Call Stack (most recent call first):
      source/CMakeLists.txt:3 (project)
  • Possible lower than 32 bit??

    Possible lower than 32 bit??

    Your plugin very fast but only if you use single bm3d. When bm3d use to mask for other denoiser (other denoiser only have cpu ver), it's very slow cause bm3d only support 32 bit. If you add support other bits lower than 32 bit, i think it's will faster

  • RGB2OPP with radius?

    RGB2OPP with radius?

    With radius=0, I can give a RGB clip that auto-gets converted to OPP format. With radiu=1, it fails. If I try to manually convert with RGB2OPP, it throws: "Python exception: BM3D_RTC: only constant format 32bit float input supported"

    Is it possible to run in OPP format with radius=1?

