BM3D denoising filter for VapourSynth, implemented in CUDA

Overview

VapourSynth-BM3DCUDA

Copyright© 2021 WolframRhodium

BM3D denoising filter for VapourSynth, implemented in CUDA

Description

Please check VapourSynth-BM3D.

Requirements

  • CPU with AVX support.

  • CUDA-enabled GPU(s) of compute capability 5.0 or higher.

  • GPU driver 450 or newer.

The minimum requirement on compute capability is 3.0, which requires manual compilation (specifying nvcc flag -gencode arch=compute_30,code=sm_30).

The _rtc version compiles code at runtime. It requires GPU driver 465 or newer and has dependencies on nvrtc64_112_0.dll/libnvrtc.so.11.2 and nvrtc-builtins64_113.dll/libnvrtc-builtins.so.11.3.109.

Parameters

bm3dcuda[_rtc].BM3D(clip clip[, clip ref=None, float[] sigma=3.0, int[] block_step=8, int[] bm_range=9, int radius=0, int[] ps_num=2, int[] ps_range=4, bint chroma=False, int device_id=0, bool fast=True])
  • clip:
    The input clip. Must be of 32 bit float format. Each plane is denoised separately if chroma is set to False.

  • ref:
    The reference clip. Must be of the same format, width, height, number of frames as clip.
    Used in block-matching and as the reference in empirical Wiener filtering, i.e. bm3d.Final / bm3d.VFinal.

  • sigma:
    The strength of denoising for each plane.
    The strength is similar (but not strictly equal) as VapourSynth-BM3D due to differences in implementation. (coefficient normalization is not implemented, for example)
    Default [3,3,3].

  • block_step, bm_range, radius, ps_num, ps_range:
    Same as those in VapourSynth-BM3D.
    If chroma is set to True, only the first value is in effect.
    Otherwise an array of values may be specified for each plane.

  • chroma:
    CBM3D algorithm. clip must be of YUV444PS format.
    Y channel is used in block-matching of chroma channels. Default False.

  • device_id:
    Set GPU to be used.
    Default 0.

  • fast:
    Multi-threaded copy between CPU and GPU at the expense of 4x memory consumption.
    Default True.

Notes

  • bm3d.VAggregate should be called after temporal filtering, as in VapourSynth-BM3D.

Statistics

GPU memory consumptions:
(ref ? 4 : 3) * (chroma ? 3 : 1) * (fast ? 4 : 1) * (2 * radius + 1) * size_of_a_single_frame

Compilation on Linux

Standard version

  • g++ 11 (or higher) is required to compile source.cpp, while nvcc 11.3 only supports g++ 10 or older.

  • Unused nvcc flags may be removed. Documentation for -gencode

cd source

nvcc kernel.cu -o kernel.o -c --use_fast_math --std=c++17 -gencode arch=compute_50,code=\"sm_50,compute_50\" -gencode arch=compute_52,code=sm_52 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_86,code=\"sm_86,compute_86\" -t 0 --compiler-bindir g++-10

g++-11 source.cpp kernel.o -o libbm3dcuda.so -shared -fPIC -I/usr/local/cuda-11.3/include -I/usr/local/include -L/usr/local/cuda-11.3/lib64 -lcudart_static --std=c++20 -march=native -O3

RTC version

cd rtc_source

g++-11 source.cpp -o libbm3drtc.so -shared -fPIC -I /usr/local/cuda-11.3/include -I /usr/local/include -L /usr/local/cuda-11.3/lib64 -lnvrtc -lcuda -Wl,-rpath,/usr/local/cuda-11.3/lib64 --std=c++20 -march=native -O3
Issues
  • Apparently RC3-rtc build crashes

    Apparently RC3-rtc build crashes

    I just tried to use rtc build and it crashes vsedit silently. gdb says that it's a segfault.

    Starting program: C:\vs-r52.1-port\vsedit.exe
    [New Thread 17896.0x4fc8]
    ...
    [New Thread 17896.0x2a78]
    
    Thread 1 received signal SIGSEGV, Segmentation fault.
    0x00007ffeb5796984 in ?? ()
    
    opened by DJATOM 12
  • Deterministic output

    Deterministic output

    Greetings, thank you for this plugin.

    I have noticed that this filter has some form of nondeterminism, unfortunately this complicates my workflow. Any chance this is something that could be looked into?

    Linux 5.12.8 CUDA 11.3.0-2 Nvidia 465.31 GTX 1080

    opened by NSQY 7
  • Reducing blocking?

    Reducing blocking?

    During some simple tests, I noticed that higher value of block_step will make more blocking when sigma is high, and block_step=8 will cause a few visible blocking even when sigma is relatively low (~5).

    As mawen mentioned in this issue: https://github.com/HomeOfVapourSynthEvolution/VapourSynth-BM3D/issues/18#issuecomment-233107735, using smaller block_step in final estimate will decrease some artifacts (including blockiness).

    Is it possible to do in this implementation?

    opened by Mr-Z-2697 5
  • Backport bm3dcpu to use C++17 only

    Backport bm3dcpu to use C++17 only

    C++20 might be too recent as its support is not yet widespread.

    This PR backports it to be usable with only C++17.

    Also replaced the use of a deprecated VS API.

    opened by AkarinVS 1
  • ParallelFilter race condition

    ParallelFilter race condition

    Sometimes, when using BM3DCUDA along with other CUDA filters, I go thru the following error:

    x265 [WARN]: detected ParallelFilter race condition on last row

    I have read that it usually happens with CUDA, in some rare threads on web.

    As example, when using the script:

    SetMemoryMax()
    SetFilterMTMode("DEFAULT_MT_MODE", 2)
    LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
    DGSource("F:\In\Cowboy Bebop\26.dgi",ct=0,cb=0,cl=236,cr=236)
    DGTelecide(mode=1, pthresh=3.5)
    DGDecimate()
    ConvertBits(32)
    BM3D_CUDA(sigma=3, radius=2)
    BM3D_VAggregate(radius=2)
    fmtc_bitdepth (bits=10,dmode=8)
    neo_f3kdb(range=15, Y=65, Cb=40, Cr=40, grainY=0, grainC=0, sample_mode=2, blur_first=true, dynamic_grain=false, mt=false, keep_tv_range=true)
    Prefetch(8)
    

    Is there anything that you can do?

    opened by tormento 6
  • Linux Build Fails from AUR

    Linux Build Fails from AUR

    When building from AUR, I get this build error

    ==> Starting build()...
    -- The CXX compiler identification is GNU 11.1.0
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++ - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Error at /usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:179 (message):
      Failed to find nvcc.
    
      Compiler requires the CUDA toolkit.  Please set the CUDAToolkit_ROOT
      variable.
    Call Stack (most recent call first):
      source/CMakeLists.txt:3 (project)
    
    opened by mysteryx93 11
  • Possible lower than 32 bit??

    Possible lower than 32 bit??

    Your plugin very fast but only if you use single bm3d. When bm3d use to mask for other denoiser (other denoiser only have cpu ver), it's very slow cause bm3d only support 32 bit. If you add support other bits lower than 32 bit, i think it's will faster

    opened by kedaitinh12 8
  • RGB2OPP with radius?

    RGB2OPP with radius?

    With radius=0, I can give a RGB clip that auto-gets converted to OPP format. With radiu=1, it fails. If I try to manually convert with RGB2OPP, it throws: "Python exception: BM3D_RTC: only constant format 32bit float input supported"

    Is it possible to run in OPP format with radius=1?

    opened by mysteryx93 3
  • BM3D Hangs For Non-power-of-2 block_step Values on RTX 2070 Super

    BM3D Hangs For Non-power-of-2 block_step Values on RTX 2070 Super

    The following script hangs for me on a RTX 2070 Super when block_step=3, 5, 6, or 7. It would seem any value that's not a power of 2 is problematic:

    import vapoursynth as vs
    core = vs.get_core()
    
    # Easier debug
    core.num_threads = 1
    
    clip = core.std.BlankClip(width=720, height=480, format=vs.YUV444PS)
    
    # Radius doesn't seem to matter
    bm3d_r = 1
    
    # cudaStreamSynchronize stops returning when block_step=3, 5, 6, or 7 (1, 2, 4 and 8 are fine)
    # fast=False for debug
    vbasic = core.bm3dcuda.BM3D(clip, radius=bm3d_r, block_step=3, fast=False).bm3d.VAggregate(radius=bm3d_r, sample=1)
    
    vbasic.set_output(0)
    

    This same script works fine for all values 1-8 on a GTX 970 in the same machine.

    I'm new to CUDA but did manage to do some debugging. Hopefully it's useful..

    When hung I see that the CPU is waiting on cudaStreamSynchronize() in BM3DGetFrame(); apparently waiting for the GPU to finish. Inspection with cuda-gdb shows that at least some of the CUDA threads are waiting on __shfl_xor_sync() at kernel.cu:459 while some others are waiting at kernel.cu:516 for __shfl_sync().

    (cuda-gdb) where         
    #0  0x0000555559715ad0 in __cuda_sm70_shflsync_idx_p ()
    #1  0x000055555976f860 in _INTERNAL_53_tmpxft_00005f71_00000000_11_kernel_compute_86_cpp1_ii_d1368b98::__shfl_sync (mask=4294967295, 
        var=705, srcLane=0, width=8) at /usr/local/cuda-11.4/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp:373
    #2  0x000055555977ce20 in bm3d<true, true, true><<<(60,160,1),(32,1,1)>>> (res=0x7ffef8000000, src=0x7ffefe000000, width=720, height=480, 
        stride=768, sigma=1.69411767, block_step=3, bm_range=12, _radius=4, ps_num=2, ps_range=6, sigma_u=0.318892747, sigma_v=0.0600268729, 
        extractor=8) at kernel.cu:516
    ...
    (cuda-gdb) where         
    #0  0x0000555559753ad0 in __cuda_sm70_shflsync_bfly_p ()
    #1  0x0000555559774500 in _INTERNAL_53_tmpxft_00005f71_00000000_11_kernel_compute_86_cpp1_ii_d1368b98::__shfl_xor_sync (mask=4294967295, 
        var=7.70065981e-07, laneMask=1, width=8) at /usr/local/cuda-11.4/bin/../targets/x86_64-linux/include/sm_30_intrinsics.hpp:449
    #2  0x000055555977b530 in bm3d<true, true, true><<<(60,160,1),(32,1,1)>>> (res=0x7ffef8000000, src=0x7ffefe000000, width=720, height=480, 
        stride=768, sigma=1.69411767, block_step=3, bm_range=12, _radius=4, ps_num=2, ps_range=6, sigma_u=0.318892747, sigma_v=0.0600268729, 
        extractor=8) at kernel.cu:459
    

    Experimentation reveals that compiling the CUDA kernel using arch=compute_60,code=sm_75 causes the problem to go away (as was recommended for diagnosis here https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/). The problem returns when arch is changed to compute_70. Additionally, modifying kernel.cu to use __activemask(), rather than the membermask variable, when __CUDA_ARCH__ >= 700 also causes the problem to stop.

    Possibly related I see in cuda-gdb that there appear to be fewer than 32 threads active in the hung warps:

    (cuda-gdb) info cuda warps
     Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel   BlockIdx First Active ThreadIdx 
    Device 0 SM 19
      0        0x00000000           0x00000000                n/a    n/a        n/a                    n/a 
      1        0x00ffffff           0xff000000 0x00000000000000d0      0  (58,12,0)                (0,0,0) 
    * 2        0x00ffffff           0xff000000 0x00000000000000d0      0   (58,0,0)                (0,0,0) 
      3        0x00000000           0x00000000                n/a    n/a        n/a                    n/a 
      4        0x00ffffff           0xff000000 0x00000000000000d0      0   (58,4,0)                (0,0,0) 
      5        0x00000000           0x00000000                n/a    n/a        n/a                    n/a 
      6        0x00ffffff           0xff000000 0x00000000000000d0      0 (58,106,0)                (0,0,0) :
    

    But the membermask value in these threads is -1. It would seem there's a problem in the calculation of membermask but I've not yet worked out what the calculation should be.

    Please let me know what other info or assistance I may provide in diagnosing this.

    Thank for your work on this project!

    Config Info NVIDIA Driver Version: 460.91.03 CUDA Version: 11.2 nvcc version: 11.4.100

    Operating System: Linux Mint 19 Kernel: Linux 5.4.0-80-generic Architecture: x86-64

    VapourSynth Video Processing Library Copyright (c) 2012-2020 Fredrik Mellbin Core R52 API R3.6 Options: -

    opened by webghost009 6
Releases(R2.8)
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Alex Kotovsky 2 Nov 29, 2021
We implemented our own sequential version of GA, PSO, SA and ACA using C++ and the parallelized version with CUDA support

We implemented our own sequential version of GA, PSO, SA and ACA using C++ (some using Eigen3 as matrix operation backend) and the parallelized version with CUDA support. All of them are much faster than the popular lib scikit-opt.

Aron751 4 May 7, 2022
Source Code for SIGGRAPH Asia 2021 Paper "Ensemble Denoising for Monte Carlo Renderings"

Ensemble Denoising Source Code for SIGGRAPH Asia 2021 Paper Ensemble Denoising for Monte Carlo Renderings. Paper | Code | Talk Slides Dependencies A C

郑少锟 43 May 10, 2022
Source Code for SIGGRAPH Asia 2021 Paper "Ensemble Denoising for Monte Carlo Renderings"

Ensemble Denoising Source Code for SIGGRAPH Asia 2021 Paper Ensemble Denoising for Monte Carlo Renderings. Paper | Code | Talk Slides Dependencies A C

郑少锟 43 May 10, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 91 Jun 13, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 35 Apr 7, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation

Lattice Boltzmann simulation I am conscious of being only an individual struggling weakly against the stream of time. But it still remains in my power

Long Nguyen 17 Mar 1, 2022
Tiny CUDA Neural Networks

This is a small, self-contained framework for training and querying neural networks. Most notably, it contains a lightning fast "fully fused" multi-layer perceptron as well as support for various advanced input encodings, losses, and optimizers.

NVIDIA Research Projects 1.4k Jun 22, 2022
HIPIFY: Convert CUDA to Portable C++ Code

Tools to translate CUDA source code into portable HIP C++ automatically

ROCm Developer Tools 160 Jun 24, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
Cooperative primitives for CUDA C++.

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model

NVIDIA Corporation 1.2k Jun 25, 2022
CUDA-accelerated Apriltag detection and pose estimation.

Isaac ROS Apriltag Overview This ROS2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publishes their poses,

NVIDIA Isaac ROS 36 Jun 24, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 36 Jun 20, 2022
CUDA Custom Buffers and example blocks

gr-cuda CUDA Support for GNU Radio using the custom buffer changes introduced in GR 3.10. Custom buffers for CUDA-enabled hardware are provided that c

GNU Radio 4 Dec 9, 2021
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Yan haixu 162 Jun 12, 2022
FoxRaycaster, optimized, fixed and with a CUDA option

Like FoxRaycaster(link) but with a nicer GUI, bug fixes, more optimized and with CUDA. Used in project: Code from FoxRaycaster, which was based on thi

Błażej Roszkowski 2 Oct 21, 2021
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 49 Jun 17, 2022
C++ Implementation of "An Equivariant Filter for Visual Inertial Odometry", ICRA 2021

EqF VIO (Equivariant Filter for Visual Inertial Odometry) This repository contains an implementation of an Equivariant Filter (EqF) for Visual Inertia

null 50 Jun 15, 2022