cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it

Overview

cuDNN Frontend API

Introduction

The cuDNN Frontend API is a C++ header-only library that demonstrates how to use the cuDNN C backend API. The cuDNN C backend API is documented in the cuDNN developer guide.

Usage

In order to include the entire library, include the cudnn_frontend header file cudnn_frontend.h into your compilation unit.

Organization

Each cudnnBackendDescriptorType_t documented in the enum is organized into its header file.

  • cudnn_frontend_Tensor.h -> CUDNN_BACKEND_TENSOR_DESCRIPTOR
  • cudnn_frontend_ConvDesc.h -> CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR
  • cudnn_frontend_PointWiseDesc.h -> CUDNN_BACKEND_POINTWISE_DESCRIPTOR
  • cudnn_frontend_Operation.h -> CUDNN_BACKEND_OPERATION_*_DESCRIPTOR
  • cudnn_frontend_OperationGraph.h -> CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR
  • cudnn_frontend_Heuristics.h -> CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR
  • cudnn_frontend_Engine.h -> CUDNN_BACKEND_ENGINE_DESCRIPTOR
  • cudnn_frontend_EngineConfig.h -> CUDNN_BACKEND_ENGINECFG_DESCRIPTOR
  • cudnn_frontend_ExecutionPlan.h -> CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR
  • cudnn_frontend_VariantPack.h -> CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR

Utility Functions

  • cudnn_frontend_find_plan.h -> Implements the cudnnFindPlan function
  • cudnn_frontend_get_plan.h -> Implements the cudnnGetPlan function
  • cudnn_frontend_Filters.h -> List of helpful utility functions to filter out execution plans

Error Handling

  • cudnn_frontend_utils.h

Fallback Lists

  • cudnn_frontend_EngineFallbackList.h -> Provides a fallback engine id if the heuristics do not provide an executable engine.

Samples

Multiple samples of convolution, dgrad, wgrad and convBiasAct are added in samples/test_list.cpp and samples/conv_sample.cpp.

Sample tests are written using the Catch2 C++ test framework.

How to build samples:

 - CUDA_PATH has the cuda installation. 
    - Include files are in CUDA_PATH/include
    - Link files are in CUDA_PATH/lib64
 - CUDNN_WRAP_PATH has the wrapper header files.

 make CUDA_PATH=/usr/local/cuda CUDNN_WRAP_PATH=/usr/local/include/

cudnnFindPlan and cudnnGetPlan:

Prior to cuDNN V8, cuDNN provided cudnnFindConvolution* and cudnnGetConvolution* functions, which provided a way to sample all the algorithms for a given problem and study the run times. This can be further used to cache the best algorithms for a given problem. In cuDNN V8, this has been replaced with cudnnFindPlan and cudnnGetPlan.

In order to use cudnnFindPlan, a user needs to provide:

  • Source for a pruned list of engineConfigs for the given problem statement
  • Filter function to Filter out the execution plan based on the prerequisite conditions

The cudnnFindPlan in turn

  • Creates a set of execution plans that are supported
  • Execute each filtered plan and ranks them in order of the execution plan runtime

The most common engineConfig generation is the built-in heuristics of cuDNN V8. Generally, this is appended with the fallback list. An example of usage can be seen in run_from_cudnn_find(...) function in conv_sample.cpp.

Documentation

Documentation can be found at https://nvidia.github.io/cudnn-frontend/

Feedback

Support, resources, and information about cuDNN can be found online at https://developer.nvidia.com/cudnn.

For questions or to provide feedback, please contact [email protected].

Comments
  • Error: CUDNN_STATUS_EXECUTION_FAILED

    Error: CUDNN_STATUS_EXECUTION_FAILED

    Hi,

    I’m currently using CUDNN to write a deep learning super-resolution sample. But the function cudnnConvolutionBiasActivationForward will return status CUDNN_STATUS_EXECUTION_FAILED. I check out the API Reference DOC and only find the description "The function failed to launch on the GPU." for this state.

    I also found a similar post in the NVIDIA forum, but there is no reply in this post. I found that the memory occupied by the graphics card will rise rapidly when running the sample. But I can't debug without specific diagnostic information. Could you give any suggestions for this issue?

    Thanks a lot in advance!

    opened by akineeic 8
  • Lack of activation function LeakyReLU

    Lack of activation function LeakyReLU

    Hi,

    I’m currently using CUDNN to write a deep learning super-resolution sample. But I found that support for LeakyReLU is not mentioned in the document. But LeakyReLU is in my pre-trained model. Could you please give any suggestions or sample code to solve this problem?

    Thanks a lot in advance!

    opened by akineeic 6
  • Cannot build nvidia-tensorflow with v0.5

    Cannot build nvidia-tensorflow with v0.5

    A Dockerfile to reproduce:

    FROM nvidia/cuda:11.4.2-cudnn8-devel-ubuntu20.04
    ENV TZ=Europe/London
    RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ >/etc/timezone
    RUN apt-get update && apt-get -y upgrade
    RUN apt-get install -y build-essential git git-lfs wget vim software-properties-common unzip python3-pip
    RUN update-alternatives --install /usr/bin/python python $(which python3) 10
    RUN pip install --upgrade numpy astor
    WORKDIR /workdir
    RUN wget https://github.com/bazelbuild/bazel/releases/download/0.26.1/bazel-0.26.1-installer-linux-x86_64.sh
    RUN chmod +x bazel-0.26.1-installer-linux-x86_64.sh && ./bazel-0.26.1-installer-linux-x86_64.sh
    
    RUN git clone https://github.com/NVIDIA/cudnn-frontend.git
    RUN git clone --branch r1.15.5+nv21.10 --single-branch https://github.com/NVIDIA/tensorflow.git
    WORKDIR /workdir/tensorflow
    ENV TF_ENABLE_XLA=0 \
        TF_NEED_OPENCL_SYCL=0 \
        TF_NEED_ROCM=0 \
        TF_NEED_CUDA=1 \
        TF_NEED_TENSORRT=0 \
        TF_CUDA_VERSION=11 \
        TF_CUBLAS_VERSION=11 \
        TF_NCCL_VERSION=2 \
        TF_CUDNN_VERSION=8 \
        TF_CUDA_PATHS="/usr/include,/usr/lib/x86_64-linux-gnu,/usr/local/cuda/include,/usr/local/cuda/lib64,/usr/local/cuda/bin,/usr/local/cuda" \
        TF_CUDA_COMPUTE_CAPABILITIES=3.5,5.0,5.2,6.1,7.0,7.5,8.6 \
        CC_OPT_FLAGS="-march=sandybridge -mfma -mfpmath=both -fopenmp"
    RUN PYTHON_BIN_PATH=$(which python) ./configure
    RUN bazel build --config=opt --config=noaws --config=nogcp --config=nohdfs --config=noignite --config=nokafka //tensorflow/tools/pip_package:build_pip_package
    

    Error message:

    ERROR: /workdir/tensorflow/tensorflow/stream_executor/cuda/BUILD:343:1: C++ compilation of rule '//tensorflow/stream_executor/cuda:cudnn_plugin' failed (Exit 1)
    In file included from bazel-out/host/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend.h:116,
                     from tensorflow/stream_executor/cuda/cuda_dnn.cc:50:
    bazel-out/host/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_ExecutionPlanCache.h:29:10: fatal error: cudnn_frontend_OperationGraph.h: No such file or directory
       29 | #include <cudnn_frontend_OperationGraph.h>
          |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    Target //tensorflow/tools/pip_package:build_pip_package failed to build
    

    while using git clone --branch v0.4.1 --single-branch https://github.com/NVIDIA/cudnn-frontend.git seems to work.

    opened by ziyuang 5
  • Support for half2 type convolution

    Support for half2 type convolution

    Hi, I'm checking mix precision programming: https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/ Does cudnn frontend has any support for half2 type convolution?

    opened by infloop777 5
  • How to use cudnn backend API to do int8x32 convolution calculation on Ampere?

    How to use cudnn backend API to do int8x32 convolution calculation on Ampere?

    Can give samples about int8x32 convolution calculation using cudnn backend API? 1、How to create xTensor, wTensor and so on? 2、How to create conv_op node? 3、Other creation.

    opened by ZhaoJob 5
  • Add -Wnon-virtual-dtor

    Add -Wnon-virtual-dtor

    This resolves the downstream issue in pytorch when we try to enable this check. More information can be found at https://github.com/pytorch/pytorch/pull/81012

    opened by huydhn 4
  • how to map to the original algorithm

    how to map to the original algorithm

    hello, thx for your great work @YangXu1990uiuc

    i am wondering if we can find a mapping relationship between the execution_engine with the original algorithm like CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD/CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM/CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM.

    opened by jackmsye 2
  • CUDNN not working with RTX A4000

    CUDNN not working with RTX A4000

    Hi there,

    I'm not sure where to open this issue, so I'm dropping it here since I cannot find the appropriate channel.

    I have been trying to crunch some numbers with several python packages. but none of those works with the RTX A4000. After debugging for days, I found the issue is not my environment but the GPU.

    More specific, I have been trying to use the following packages:

    MXNet CuPy

    My first impression was that there was something wrong with my environment. Then I tried to just run a docker container to test the GPU along with MXNet package:

    docker pull mxnet/python:gpu

    Trying to run either mxnet or cupy hangs... it does not compute anything...

    Then I tried the default nvidia container:

    docker run --rm --gpus all nvidia/cuda nvidia-smi

    The output seems ok:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA RTX A4000    Off  | 00000000:01:00.0  On |                  Off |
    | 41%   49C    P2    37W / 140W |    787MiB / 16109MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    

    It is very confusing because I can run Pytorch, TensoRT etc in this machine...

    Then I tried to run the installer over again for the following theNVIDIA Guide for CUDNN:

    $cp -r /usr/src/cudnn_samples_v8/ $HOME
    $ cd  $HOME/cudnn_samples_v8/mnistCUDNN
    $make clean && make
    $ time ./mnistCUDNN
    

    Then what I get is:

    Executing: mnistCUDNN
    cudnnGetVersion() : 8300 , CUDNN_VERSION from cudnn.h : 8300 (8.3.0)
    Host compiler version : GCC 7.5.0
    
    There are 1 CUDA capable devices on your machine :
    device 0 : sms 48  Capabilities 8.6, SmClock 1560.0 Mhz, MemSize (Mb) 16109, MemClock 7001.0 Mhz, Ecc=0, boardGroupID=0
    Using device 0
    
    Testing single precision
    Loading binary file data/conv1.bin
    Loading binary file data/conv1.bias.bin
    Loading binary file data/conv2.bin
    Loading binary file data/conv2.bias.bin
    Loading binary file data/ip1.bin
    Loading binary file data/ip1.bias.bin
    Loading binary file data/ip2.bin
    Loading binary file data/ip2.bias.bin
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.017408 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.018432 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.027648 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.060416 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.096256 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 2000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.037888 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.046080 time requiring 2000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.055296 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.057344 time requiring 128000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.078848 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
    Resulting weights from Softmax:
    0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000 
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.016384 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.016384 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.017408 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.055296 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.060416 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 2000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.037888 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.041984 time requiring 2000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.050176 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.063488 time requiring 128000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.076800 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000 
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006 
    
    Result of classification: 1 3 5
    
    Test passed!
    
    Testing half precision (math in single precision)
    Loading binary file data/conv1.bin
    Loading binary file data/conv1.bias.bin
    Loading binary file data/conv2.bin
    Loading binary file data/conv2.bias.bin
    Loading binary file data/ip1.bin
    Loading binary file data/ip1.bias.bin
    Loading binary file data/ip2.bin
    Loading binary file data/ip2.bias.bin
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.009216 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.031744 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.061440 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.187392 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.235520 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.066560 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.080864 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.116736 time requiring 64000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.120832 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.121856 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
    Resulting weights from Softmax:
    0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001 
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.024576 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.029696 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.041984 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.076800 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.078848 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.037888 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.057344 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.091136 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.106496 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.108544 time requiring 64000 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000 
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 
    
    Result of classification: 1 3 5
    
    Test passed!
    
    real	7m18.842s
    user	7m17.724s
    sys	0m0.516s
    
    

    This takes a LONG time to execute.

    Then, for a final test, I just swapped the RTX A4000 by an old NVidia Tesla K80... Suddenly everything works like a charm...

    I'm almost sure there is something wrong either with the driver or with the GPU firmware. I tried the code I have in Google Colab, the old K80, and even a Jetson AGX and it works, but not with the RTX A4000

    The system information for CuPy(python) output which also takes a long time is:

    import cupy as cp
    print(cp.show_config())
    
    OS                           : Linux-5.4.0-42-generic-x86_64-with-Ubuntu-18.04-bionic
    Python Version               : 3.6.9
    CuPy Version                 : 9.6.0
    CuPy Platform                : NVIDIA CUDA
    NumPy Version                : 1.19.5
    SciPy Version                : 1.5.4
    Cython Build Version         : 0.29.22
    Cython Runtime Version       : 0.29.22
    CUDA Root                    : /usr/local/cuda
    nvcc PATH                    : /usr/local/cuda/bin/nvcc
    CUDA Build Version           : 10020
    CUDA Driver Version          : 11050
    CUDA Runtime Version         : 10020
    cuBLAS Version               : (available)
    cuFFT Version                : 10102
    cuRAND Version               : 10102
    cuSOLVER Version             : (10, 3, 0)
    cuSPARSE Version             : (available)
    NVRTC Version                : (10, 2)
    Thrust Version               : 100907
    CUB Build Version            : 100800
    Jitify Build Version         : 60e9e72
    cuDNN Build Version          : 8204
    cuDNN Version                : 8300
    NCCL Build Version           : 21104
    NCCL Runtime Version         : 21104
    cuTENSOR Version             : None
    cuSPARSELt Build Version     : None
    Device 0 Name                : NVIDIA RTX A4000
    Device 0 Compute Capability  : 86
    Device 0 PCI Bus ID          : 0000:01:00.0
    None
    

    Apologies if this is not the right place for this issue, I really appreciate your help.

    opened by vsantosu 2
  • need default return value for cudnn_frontend::PointWiseDesc_v8::getPortCount() const

    need default return value for cudnn_frontend::PointWiseDesc_v8::getPortCount() const

    Hi all,

    when building pytorch it treats warnings as errors and it barfs here:

    In file included from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:36,
                     from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_OperationGraph.h:36,
                     from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_Heuristics.h:30,
                     from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend.h:101,
                     from /opt/pytorch/pytorch/aten/src/ATen/native/cudnn/Conv_v8.cpp:10:
    /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h: In member function ‘int64_t cudnn_frontend::PointWiseDesc_v8::getPortCount() const’:
    /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h:120:5: error: control reaches end of non-void function [-Werror=return-type]
      120 |     }
    

    I think a default value should be returned to fix this.

    opened by azrael417 1
  • V0.7.3 rc

    V0.7.3 rc

    [Enhancement] Added a CUDNN_FRONTEND_VERSION macro to cudnn_frontend. [Enhancement] Added the inline keyword to the get_plan functions to enable inclusion in multiple compilation units. [Bug fix] Replace CUDNN with CUDNN_VERSION as the right macro names.

    opened by Anerudhan 0
  • V0.7.3 rc

    V0.7.3 rc

    [Enhancement] Added a CUDNN_FRONTEND_VERSION macro to cudnn_frontend. [Enhancement] Added the inline keyword to the get_plan functions to enable inclusion in multiple compilation units. [Bug fix] Replace CUDNN with CUDNN_VERSION as the right macro names.

    opened by Anerudhan 0
  • Is dgrad+relu with fp32 supported?

    Is dgrad+relu with fp32 supported?

    I have try to fuse dgrad with some pointwise operation like relu, but when cudnnBackendFinalize(plan), there is a error. CUDNN_STATUS_NOT_SUPPORTED.

    I am not sure if my code error. I want know, is dgrad + pointwise support now in cudnn 8.6 for Turing Arch? Could you give me some sample about dgrad fusion? In this project, only one case for dgrad, but it it only support for Ampere Arch.

    opened by tjich 1
  • Execute matmul op faild

    Execute matmul op faild

    I tried to run the matmul op and follow the codes in function "void run_matmul_bias_gelu" in fusion_sample.cpp. My code is as below:

    int main() {
      auto ha = loadBinary<float>("/workspace/features.bin");
      auto hb = loadBinary<float>("/workspace/weight.bin");
      auto hc = loadBinary<float>("/workspace/output.bin");
    
      float *a_ptr;
      cudaMalloc((void **)&a_ptr, ha.size() * sizeof(float));
      cudaMemcpy(a_ptr, ha.data(), ha.size() * sizeof(float),
                 cudaMemcpyHostToDevice);
      float *b_ptr;
      cudaMalloc((void **)&b_ptr, hb.size() * sizeof(float));
      cudaMemcpy(b_ptr, hb.data(), hb.size() * sizeof(float),
                 cudaMemcpyHostToDevice);
    
      float *c_ptr;
      cudaMalloc((void **)&c_ptr, hc.size() * sizeof(float));
    
      const int m = ha.size() / 96;
      const int n = 96;
      const int k = 96;
    
      int64_t stride[3];
      int64_t a_dim[3] = {1, m, k};
      int64_t b_dim[3] = {1, k, n};
      int64_t c_dim[3] = {1, m, n};
      generateStrides(a_dim, stride, 3, CUDNN_TENSOR_NCHW);
      auto aMatrixTensor =
          cudnn_frontend::TensorBuilder()
              .setDim(3, a_dim)
              .setStride(3, stride)
              .setId('a')
              .setAlignment(
                  16)  // 16B alignment is needed to run a tensor core engine
              .setDataType(CUDNN_DATA_FLOAT)
              .build();
    
      generateStrides(b_dim, stride, 3, CUDNN_TENSOR_NCHW);
      auto bMatrixTensor = cudnn_frontend::TensorBuilder()
                               .setDim(3, b_dim)
                               .setStride(3, stride)
                               .setId('b')
                               .setAlignment(16)
                               .setDataType(CUDNN_DATA_FLOAT)
                               .build();
    
      generateStrides(c_dim, stride, 3, CUDNN_TENSOR_NCHW);
      auto afterMatMulTensor = cudnn_frontend::TensorBuilder()
                                   .setDim(3, c_dim)
                                   .setStride(3, stride)
                                   .setId('A')  // after matmul
                                   .setAlignment(16)
                                   .setVirtual()
                                   .setDataType(CUDNN_DATA_FLOAT)
                                   .build();
    
      std::cout << aMatrixTensor.describe() << std::endl;
      std::cout << bMatrixTensor.describe() << std::endl;
      std::cout << afterMatMulTensor.describe() << std::endl;
    
      // Define the matmul desc
      auto matmulDesc = cudnn_frontend::MatMulDescBuilder()
                            .setComputeType(CUDNN_DATA_FLOAT)
                            .build();
      std::cout << matmulDesc.describe() << std::endl;
    
      // Create a matmul Node
      auto matmul_op = cudnn_frontend::OperationBuilder(
                           CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR)
                           .setaMatDesc(aMatrixTensor)
                           .setbMatDesc(bMatrixTensor)
                           .setcMatDesc(afterMatMulTensor)
                           .setmatmulDesc(matmulDesc)
                           .build();
      std::cout << matmul_op.describe() << std::endl;
    
      std::array<cudnn_frontend::Operation const *, 1> ops = {&matmul_op};
      cudnnHandle_t handle_;
      checkCudnnErr(cudnnCreate(&handle_));
    
      auto opGraph = cudnn_frontend::OperationGraphBuilder()
                         .setHandle(handle_)
                         .setOperationGraph(ops.size(), ops.data())
                         .build();
      auto plan =
          get_execplan_from_heuristics_else_fall_back(std::move(opGraph), handle_);
    
      auto workspace_size = plan.getWorkspaceSize();
      std::cout << plan.describe() << " requires workspace " << workspace_size
                << std::endl;
    
      void *workspace_ptr = nullptr;
      if (workspace_size > 0) {
        cudaMalloc(&workspace_ptr, (size_t)workspace_size);
      }
    
      void *data_ptrs[] = {a_ptr, b_ptr, c_ptr};
      int64_t uids[] = {'a', 'b', 'c'};
      auto variantPack = cudnn_frontend::VariantPackBuilder()
                             .setWorkspacePointer(workspace_ptr)
                             .setDataPointers(3, data_ptrs)
                             .setUids(3, uids)
                             .build();
    
      cudnnStatus_t status = cudnnBackendExecute(handle_, plan.get_raw_desc(),
                                                 variantPack.get_raw_desc());
    
      if (workspace_size > 0) {
        (cudaFree(workspace_ptr));
      }
      checkCudnnErr(cudnnDestroy(handle_));
      cudnn_frontend::throw_if(
          [status]() { return (status != CUDNN_STATUS_SUCCESS); },
          "Plan execute error", status);
    }
    

    I run it under cuda-memcheck and got logs as below, is there anything g wrong with my codes? I use docker:nvcr.io/nvidia/pytorch:21.10-py3 and my driver version is Driver Version: 515.65.01

    root@3c261e44c675:/workspace/trt_inference/bin# cuda-memcheck ./trtexc ========= CUDA-MEMCHECK ha size is 5164704 hb size is 9216 hc size is 5164704 CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: CUDNN_DATA_FLOAT Id: 97 Alignment: 16 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 1,53799,96 ] Str [,5164704,96,1] isVirtual: 0 isByValue: 0 CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: CUDNN_DATA_FLOAT Id: 98 Alignment: 16 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 1,96,96 ] Str [,9216,96,1] isVirtual: 0 isByValue: 0 CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: CUDNN_DATA_FLOAT Id: 65 Alignment: 16 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 1,53799,96 ] Str [,5164704,96,1] isVirtual: 1 isByValue: 0 CUDNN_BACKEND_MATMUL_DESCRIPTOR : Math precision 0 CUDNN_BACKEND_OPERATION : OpMode: 19 X 0 Y 0 W 0 B 0 T 0 DW 0 DY 0 DX 0 C 0 A Mtrix 0x557468e919e0 B Mtrix 0x5574c343d5d0 C Mtrix 0x5574c35f37b0 P 0 MatMul 0x5574c35f39c0 Reduction 0 alphabetaType 4 Alpha: 1 1 Alpha2: 1 1 Beta: 0 0 Heuristic has 3 configurations CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR : Matmul_eng0_k24=27, numeric_notes:[CUDNN_NUMERICAL_NOTE_TENSOR_CORE,] behavior_notes:[CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION,] workSpaceSize: 0 requires workspace 0 terminate called after throwing an instance of 'cudnn_frontend::cudnnException' what(): Plan execute error ========= Error: process didn't terminate successfully ========= No CUDA-MEMCHECK results found

    Hope to get help! Thanks.

    opened by Gebixiaochen 6
  • INT8 sample didn't work?

    INT8 sample didn't work?

    I tried cuda 11.2 ,11.4 and cudnn 8.1 , 8.2, 8.4, the Sample ConvScaleBiasAct_int8 didn't work in any combination of these env, and it will return the error : "[ERROR] Exception CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED" my device type is A100, is there anything wrong?

    opened by vincentccc 0
  • Many samples don't work for me

    Many samples don't work for me

    I'm familiar with how to use CUDA and more specifically cuDNN., and I'm now trying to get started with the cuDNN backend API by experimenting with the samples in this repository. However, many of them don't seem to work for me.

    My environment

    • Windows 11 version 21H2 (OS Build 22000.613)
    • Cuda 1.5 V11.5.119
    • Cudnn 8.3.3
    • Visual Studio 2019 version 16.0
    • (the machine I'm compiling this on only has a GTX 1060 MaxQ, but I don't think that's very relevant yet)

    General errors

    • ConvBiasAct: Setting a relu clip slope doesn't work, the result is always [ERROR] Exception CUDNN_BACKEND_POINTWISE_DESCRIPTOR: SetAttribute CUDNN_ATTR_POINTWISE_RELU_LOWER_CLIP_SLOPE, Failed cudnn_status: CUDNN_STATUS_BAD_PARAM

    • ConvBiasAct: After commenting out the previous code, the next issue is [ERROR] Exception CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: GetAttribute CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE Failed cudnn_status: CUDNN_STATUS_BAD_PARAM. This code is gated behind a cuDNN version check, but my version is high enough so I don't know why this fails. After commenting out this and the previous issues this sample is running fine.

    • ConvBiasScaleAct, ConvBiasScaleAct_int8, ConvScaleBiasAddAct sample, ConvScaleBiasAddAct sample_float and fail with the exact same [ERROR] Exception CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR: cudnn Finalize failed cudnn_status: CUDNN_STATUS_BAD_PARAM. No idea what could cause this.

    • Multihead attention sample, MatmulBiasAct sample and MatmulBiasAct sample_float, all fail with [ERROR] Exception CUDNN_BACKEND_MATMUL_DESCRIPTOR: cudnnCreate Failed cudnn_status: CUDNN_STATUS_ALLOC_FAILED. I'm not sure what could cause this, I have plenty of free CPU and GPU memory.

    • ConvDrelu and DgradDrelu fail with [ERROR] Exception CUDNN_BACKEND_OPERATION: SetAttribute CUDNN_ATTR_OPERATION_POINTWISE_DYDESC Failed cudnn_status: CUDNN_STATUS_BAD_PARAM Maybe pointwise derivatives are properly supported yet? There is no documentation in the API reference for this attribute name yet.

    Hard-crash in BN Finalize:

    • Creation of the third tensor here fails: https://github.com/NVIDIA/cudnn-frontend/blob/43709ab96c47e26eebcdac72f93f946d44ceffa8/samples/fusion_sample.cpp#L2370-L2372. I assume this is because there is some issue with the CUDNN_DATA_INT64 datatype? Specifically the cudnnBackendFinalize of the tensor descriptor crashes internally (it's not just a non-success return value, the call itself actually generates an exception).

    Conclusion

    In conclusion, everything seems to be very brittle at the moment. Is this because I am using some wrong versions? Or is the cuDNN backend/frontend API still too much of a work-in-progress for end users? Or is Windows support still lacking?

    opened by KarelPeeters 4
  • Inference result of deep learning model is all NAN

    Inference result of deep learning model is all NAN

    Hi,

    I’m currently using CUDNN to write a deep learning super-resolution sample. But I found that the inference result of the model is all NAN. Then I tried to print the output of the middle layer and found that the output would be enlarged after one block, and the data exceeded the representation range after more than 20 blocks. As shown in the figure below.

    d59e9e23d3bfc17fe2ee0b01155d4df

    But I compare it with the python version and confirmed that the input image and weight of convolutional layers are consistent. So I think I made some mistakes while building the model using the CUDNN API. Could you please take a look at my code to see if I made any obvious mistakes and give me some debugging suggestions?

    Thanks a lot in advance!

    opened by akineeic 2
Releases(v0.7.3)
  • v0.7.3(Oct 28, 2022)

    v0.7.3 Release Notes:

    [Enhancement] Added a CUDNN_FRONTEND_VERSION macro to cudnn_frontend. [Enhancement] Added the inline keyword to the get_plan functions to enable inclusion in multiple compilation units. [Bug fix] Replace CUDNN with CUDNN_VERSION as the right macro names.

    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Oct 20, 2022)

    Release Notes:

    cudnn_frontend v0.7 aims to target the new features introduced in cudnn version v8.5 (https://developer.nvidia.com/cudnn). The following are the changes in the v0.7 release.

    [New API] Added support for Resample operation.

    [New API] Tensor class has a clone method which allows a user to quickly create a new Tensor object with similar attributes.

    [New API] Added support for new pointwise operations CUDNN_POINTWISE_ERF, CUDNN_POINTWISE_GELU_APPROX_TANH_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, CUDNN_POINTWISE_IDENTITY.

    [New API] Several API names have been unified and made consistent across multiple descriptors for readability.

    setComputePrecision/setMathPrecision/setMathType have been unified into setComputeType in cudnn_frontend_ConvDesc.h, cudnn_frontend_MatMulDesc.h, cudnn_frontend_Operation.h, cudnn_frontend_PointWiseDesc.h, cudnn_frontend_ReductionDesc.h, cudnn_frontend_Resample.h Math operations like ConvDesc, ResampleDesc have getSpatialDimCount instead of getDimCount to avoid confusion with Tensor Dimensions. Accessors for arrays will have [g,s]et[Spatial] as the API. [Spatial] is only needed when the attribute is common to both Tensor descriptor and Operation descriptor. Currently, its only the Stride and DimCount attributes that have ambiguity. setArray functions will take size and pointer as arguments eg. setStride(int dim, int64_t* arr), setSpatialStride(int dim, int64_t* arr) getArray functions will return a pointer to the array whose size is determined by getDimCount or getSpatialDimCount [Minor Enhancement] Execution plans and Operation Graph printout more information in their describe() method.

    [Bug Fixes] Some samples have been updated to go over all fallback configs to ensure that a successful plan is built.

    [Bug Fixes] Execution plans had wrongly initialized numerical note CUDNN_NUMERICAL_NOTE_TYPE_TENSOR_CORE. This has been fixed.

    [Samples] Added a new sample that does scale and bias of two tensors, adds them followed by a ReLU operation to show how fused operations work.

    [Samples] Added a sample to demonstrate how the resample operation works.

    [Samples] Added a new sample which shows convolution followed by multiple scales.

    [Samples] Added a sample to show Fully Connected Layer fused with GeLU forward.

    [Samples] Added a new sample to show fused backward activation, backward bias and backward Data Grad operation.

    The current FE is designed to be compatible with all minor releases in the cuDNN 8.x version

    v0.7.1 [Enhancement] Additional commit to remove an extraneous include to cudnn_ops_infer.h

    v0.7.2 [Enhancement] Fixed issues in the code which caused warnings in MSVC and clang compilers.

    [Enhancement] Fixed errors in get_heuristics_list where for certain heuristics mode in older cuDNN versions, the heuristics list might be incorrect.

    [Bug fixes] Fixed several test cases failing on unsupported GPUs to exit gracefully.

    [Samples] Added a sample to showcase fp8 convolution forward in Nvidia Hopper GPUs. The sample also showcases post convolution book-keeping operations such as scaling and absolute maximum reduction.

    [Samples] Added a sample which converts fp16 tensor to fp8 and performs transpose and absolute maximum reduction.

    [Samples] Added a sample to demonstrate Max pooling operation including tensor index dump, necessary to speed up the backward pass.

    [Samples] Added a sample to showcase the backward pooling operation.

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Aug 27, 2022)

    Release Notes:

    cudnn_frontend v0.7 aims to target the new features introduced in cudnn version v8.5 (https://developer.nvidia.com/cudnn). The following are the changes in the v0.7 release.

    [New API] Added support for Resample operation.

    [New API] Tensor class has a clone method which allows a user to quickly create a new Tensor object with similar attributes.

    [New API] Added support for new pointwise operations CUDNN_POINTWISE_ERF, CUDNN_POINTWISE_GELU_APPROX_TANH_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, CUDNN_POINTWISE_IDENTITY.

    [New API] Several API names have been unified and made consistent across multiple descriptors for readability.

    • setComputePrecision/setMathPrecision/setMathType have been unified into setComputeType in cudnn_frontend_ConvDesc.h, cudnn_frontend_MatMulDesc.h, cudnn_frontend_Operation.h, cudnn_frontend_PointWiseDesc.h, cudnn_frontend_ReductionDesc.h, cudnn_frontend_Resample.h
    • Math operations like ConvDesc, ResampleDesc have getSpatialDimCount instead of getDimCount to avoid confusion with Tensor Dimensions.
    • Accessors for arrays will have [g,s]et[Spatial]<AttributeName> as the API. [Spatial] is only needed when the attribute is common to both Tensor descriptor and Operation descriptor. Currently, its only the Stride and DimCount attributes that have ambiguity.
      • setArray functions will take size and pointer as arguments eg. setStride(int dim, int64_t* arr), setSpatialStride(int dim, int64_t* arr)
      • getArray functions will return a pointer to the array whose size is determined by getDimCount or getSpatialDimCount

    [Minor Enhancement] Execution plans and Operation Graph printout more information in their describe() method.

    [Bug Fixes] Some samples have been updated to go over all fallback configs to ensure that a successful plan is built.

    [Bug Fixes] Execution plans had wrongly initialized numerical note CUDNN_NUMERICAL_NOTE_TYPE_TENSOR_CORE. This has been fixed.

    [Samples] Added a new sample that does scale and bias of two tensors, adds them followed by a ReLU operation to show how fused operations work.

    [Samples] Added a sample to demonstrate how the resample operation works.

    [Samples] Added a new sample which shows convolution followed by multiple scales.

    [Samples] Added a sample to show Fully Connected Layer fused with GeLU forward.

    [Samples] Added a new sample to show fused backward activation, backward bias and backward Data Grad operation.

    The current FE is designed to be compatible with all minor releases in the cuDNN 8.x version

    v0.7.1

    [Enhancement] Additional commit to remove an extraneous include to cudnn_ops_infer.h

    Source code(tar.gz)
    Source code(zip)
  • v0.7(Aug 25, 2022)

    Release Notes:

    cudnn_frontend v0.7 aims to target the new features introduced in cudnn version v8.5 (https://developer.nvidia.com/cudnn). The following are the changes in the v0.7 release.

    [New API] Added support for Resample operation.

    [New API] Tensor class has a clone method which allows a user to quickly create a new Tensor object with similar attributes.

    [New API] Added support for new pointwise operations CUDNN_POINTWISE_ERF, CUDNN_POINTWISE_GELU_APPROX_TANH_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, CUDNN_POINTWISE_IDENTITY.

    [New API] Several API names have been unified and made consistent across multiple descriptors for readability.

    • setComputePrecision/setMathPrecision/setMathType have been unified into setComputeType in cudnn_frontend_ConvDesc.h, cudnn_frontend_MatMulDesc.h, cudnn_frontend_Operation.h, cudnn_frontend_PointWiseDesc.h, cudnn_frontend_ReductionDesc.h, cudnn_frontend_Resample.h
    • Math operations like ConvDesc, ResampleDesc have getSpatialDimCount instead of getDimCount to avoid confusion with Tensor Dimensions.
    • Accessors for arrays will have [g,s]et[Spatial]<AttributeName> as the API. [Spatial] is only needed when the attribute is common to both Tensor descriptor and Operation descriptor. Currently, its only the Stride and DimCount attributes that have ambiguity.
      • setArray functions will take size and pointer as arguments eg. setStride(int dim, int64_t* arr), setSpatialStride(int dim, int64_t* arr)
      • getArray functions will return a pointer to the array whose size is determined by getDimCount or getSpatialDimCount

    [Minor Enhancement] Execution plans and Operation Graph printout more information in their describe() method.

    [Bug Fixes] Some samples have been updated to go over all fallback configs to ensure that a successful plan is built.

    [Bug Fixes] Execution plans had wrongly initialized numerical note CUDNN_NUMERICAL_NOTE_TYPE_TENSOR_CORE. This has been fixed.

    [Samples] Added a new sample that does scale and bias of two tensors, adds them followed by a ReLU operation to show how fused operations work.

    [Samples] Added a sample to demonstrate how the resample operation works.

    [Samples] Added a new sample which shows convolution followed by multiple scales.

    [Samples] Added a sample to show Fully Connected Layer fused with GeLU forward.

    [Samples] Added a new sample to show fused backward activation, backward bias and backward Data Grad operation.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.3(Jul 14, 2022)

    • [New Feature] Serialization:

      Execution Plan Serialization and Deserialization (Experimental)

      cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.

      API:

        - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
        - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
      
    • [New API] Added a new API

      get_heuristics_list(std::array<std::string, SIZE> modes,
        OperationGraph_v8 &opGraph,
        std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
        EngineConfigList &filtered_configs,
        bool evaluate_all = false)
      

      This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

    • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

    • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

    • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

    • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

    • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

    • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

    • [Cleanup] Take UIDs of variant pack as a const pointer.

    • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

    • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

    • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

    • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

    • Added missing return statements for operation.

    • Added as warn-as-error to the Samples Makefile.

    • Addressed multiple compiler warning triggered by clang.

      • Unused variables.
      • Undefined destructor for class with virtual methods
    • During the heuristics query if the heur_mode_b fails it fallbacks to heur_mode_a(heur_mode_instant)

    • Addressed a bug to initiate the numerical notes and behavior notes to max values instead of 0.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.2(Apr 21, 2022)

    • [New Feature] Serialization:

      Execution Plan Serialization and Deserialization (Experimental)

      cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.

      API:

        - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
        - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
      
    • [New API] Added a new API

      get_heuristics_list(std::array<std::string, SIZE> modes,
        OperationGraph_v8 &opGraph,
        std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
        EngineConfigList &filtered_configs,
        bool evaluate_all = false)
      

      This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

    • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

    • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

    • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

    • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

    • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

    • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

    • [Cleanup] Take UIDs of variant pack as a const pointer.

    • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

    • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

    • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

    • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

    • Added missing return statements for operation.

    • Added as warn-as-error to the Samples Makefile.

    • Addressed multiple compiler warning triggered by clang.

      • Unused variables.
      • Undefined destructor for class with virtual methods
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Apr 9, 2022)

    cuDNN Frontend v0.6 release

    • [New Feature] Serialization: (https://github.com/NVIDIA/cudnn-frontend/pull/26)

    Execution Plan Serialization and Deserialization (Experimental)

    cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.
    
    ### API:
        - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
        - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
    
    • [New API] Added a new API

      get_heuristics_list(std::array<std::string, SIZE> modes,
        OperationGraph_v8 &opGraph,
        std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
        EngineConfigList &filtered_configs,
        bool evaluate_all = false)
      

      This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

    • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

    • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

    • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

    • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

    • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

    • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

    • [Cleanup] Take UIDs of variant pack as a const pointer.

    • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

    • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

    • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

    • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

    v0.6.1

    • [Cleanup] Patch a fix for compilation errors in cuDNN v8.3 and below
    Source code(tar.gz)
    Source code(zip)
  • v0.6(Apr 7, 2022)

    cuDNN Frontend v0.6 release

    • [New Feature] Serialization: (https://github.com/NVIDIA/cudnn-frontend/pull/26)

    Execution Plan Serialization and Deserialization (Experimental)

    cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.
    
    ### API:
        - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
        - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
    
    • [New API] Added a new API

      get_heuristics_list(std::array<std::string, SIZE> modes,
        OperationGraph_v8 &opGraph,
        std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
        EngineConfigList &filtered_configs,
        bool evaluate_all = false)
      

      This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

    • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

    • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

    • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

    • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

    • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

    • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

    • [Cleanup] Take UIDs of variant pack as a const pointer.

    • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

    • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

    • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

    • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Jan 25, 2022)

    • Fix an issue where cuDNN Frontend API always used the default stream in cudnn_find_plan (autotuning). Now, the stream is queried from the handle.
    • Updated CMakelist.txt to depend on CUDNN_FRONTEND_PATH environment variable.
    • Fixed a compilation warnings for missing return values
    Source code(tar.gz)
    Source code(zip)
  • v0.5(Nov 12, 2021)

    Release Notes:

    - [New API]: Execution Plan Caching
       ## Execution Plan Caching
          cuDNN through heuristics provides a way to query a list of good engine configs. Based on this query we build the cudnn_frontend_find_plan function which runs all the engineConfig(s) on the given user system and returns a sorted list of plans. 
          This process of running multiple plans through several iterations is time consuming. The ExecutionPlanCache allows the user to build a cache with operation graph as the key to query an execution plan. It is the responsibilty of the user to maintain different ca  ches for different types of operation_graphs (For eg. different cache for convolutionForward compared to Dgrad or Wgrad).                                                                                                                                   
      
          ### API:
           - add_plan_to_cache(const cudnn_frontend::OperationGraph &op_graph, const cudnn_frontend::ExecutionPlan &plan) : Creates a mapping between the operation graph and executionPlan
           - bool get_plan_from_cache(const cudnn_frontend::OperationGraph &op_graph, const cudnn_frontend::ExecutionPlan *&plan) : Sets the executionPlan in the plan pointer and returns true if found. 
           - cudnnFindPlanAndCache(cudnnHandle_t handle, cudnn_frontend::OperationGraph &opGraph, cudnn_frontend::VariantPack const &variantPack, cudnn_frontend::ExecutionPlanCache &cache, Predicate pred) -> cudnn_frontend::ExecutionPlan : The above API chains the output of cudnn_frontend_find_plan and caches the result for future usage.
    
    - [New Feature]:  Allows logging in the cudnn frontend.
       ## Logging
          cuDNN Frontend API logging records execution flow through cuDNN frontend API. This functionality is disabled by default, and can be enabled through methods described in this section.
    
          ### Method 1: Using Environment Variables:
          | Environment variables                             | CUDNN_FRONTEND_LOG_INFO=0 | CUDNN_FRONTEND_LOG_INFO=1 |
          | --------------------------------------------------| ------------------------- | -----------               |
          | CUDNN_FRONTEND_LOG_FILE not set                   | No Logging                | No Logging                |
          | CUDNN_FRONTEND_LOG_FILE set to stdout or stderr   | No Logging                | Logging to cout or cerr   |
          | CUDNN_FRONTEND_LOG_FILE set to filename.txt       | No Logging                | Logging to the filename   |
    
          ### Method 2: Using API calls:
          Calling `cudnn_frontend::isLoggingEnabled() = true|false` has same effect of setting the environment variable.
          Calling `cudnn_frontend::getStream() = stream_name` can be used to assign the output stream directly.
    
    - [New API]: cudnnReorderFilterAndBiasInt8x32 :- Reorders the filter and bias tensors which allows the tensor cores to be used during Int8x32 convolutions
    - [New Feature]: Add support for isByValue attribute setting in tensor.
    
    - [Samples]: Clean up Makefile and move to cmake based setup. Allows samples to be compiled on Windows machines.
    - [Samples]: Updated samples to query the heuristics for fusion cases.
    - [Samples]: Added a new ConvScaleBiasAct_int8 sample to address https://github.com/NVIDIA/cudnn-frontend/issues/8
    - [Samples]: Added a sample to demonstrate how execution Plan caching works.
    - [Samples]: Added a new sample to show how Multi-Headed Attention can be implemented with run time fusion. Will work with 8.3.1
    - [Cleanup]: ExecutionPlan cache has copy contructor and pre-emptively caches the workspace, numerical and behavior notes.
    - [Cleanup]: Update cudnn_frontend_PointWiseDesc.h to include limits to fix gcc 11 compilation error.( https://github.com/NVIDIA/cudnn-frontend/pull/10)
    - [Cleanup]: Verify out of bounds iterator in Errata.h (https://github.com/NVIDIA/cudnn-frontend/pull/11)
    - [Cleanup]: Added default move assign and move constructor to all classes.
    - [Cleanup]: CheckcudaError and checkCudnnError correctly asserts now instead of silently failing.
    - [Cleanup]: Updated errata filter to no-longer block the engine ID 0 when running the Int8x32. 
    - [Cleanup]: Default value of knobs in engine config is not 0 anymore. 
    
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Aug 13, 2021)

    [Bug Fix]: Fixed an issue where the vector count was not copied over during move construction phase. [Samples]: Added a new sample for INT8x32 config (utilizing integer tensor cores). The example includes an errata filter which blocks an engine that has a known issue running this config. [CleanUp]: Change all move constructors and fixed move assignment operator.

    Co-authored-by: agopal [email protected]

    Source code(tar.gz)
    Source code(zip)
  • v0.4(Jul 1, 2021)

    [New API] : Added a new function get_heuristics_list which accepts a list of heuristics mode and returns a concatenated list of the engine heuristics. [New Feature]: New mode of heuristic (HEUR_MODE_FALLBACK] added to the backend. Sample updated to use that and provides a generic way to access the fallback engines. FallbackEngineList is retained as a way to add custom engines in the frontend. [New Feature]: Added support to set vectorization dimension and vectorization count attributes in the tensor descriptor. [Rename]: setDataType in OperationBuilder deprecated and replaced with more clear setComputePrecision() [CleanUp] : cudnnFindPlan and cudnnGetPlan takes L-value operationGraph rather than previously R-value. [CleanUp] : cudnnFindPlan and time_sorted_plan return executionPlans_t (which is a vector plans) instead of executionOptions_t (which is a vector of struct containing plan and time). This is to achieve compatibility with the cudnnGet. [Samples]: New sample added for DP4A. [Samples]: ConvBiasScaleRelu sample| [Bug fix]: Errata filter was erroneously filtering out unspecified engines.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jun 9, 2021)

    [Maintenance] Adding status check on the cudnnBackendExecute during warm up. [Maintenance] Adding status check on json_handle when loading from a file

    Source code(tar.gz)
    Source code(zip)
  • v0.3(May 17, 2021)

    [New feature] Support reduction operation in the frontend.
    [New feature] Add engine runtime compilation filter in the frontend as a behavior filter.
    [New feature] Adding fallback list for convBiasAct
    [New feature Beta] Adding Errata filter with an sample.
    [Samples] Add ConvBnstats and ConvColReduction tests
    [Bug Fix] Clamp upper_clip for float compute type to float max for pointwise descriptor when computeType is float.
    [Bug Fix] Compilation fix for newer gcc toolchain (gcc 9+).
    [Bug Fix] Add operation tag to the Plan generated by cudnnFind and cudnnGet
    [Maintenance] Added default fallback lists to newer versions of cudnn.
    
    Source code(tar.gz)
    Source code(zip)
Owner
NVIDIA Corporation
NVIDIA Corporation
Simple samples for TensorRT programming

Introduction This is a collection of simplified TensorRT samples to get you started with TensorRT programming. Most of the samples are written in C++,

NVIDIA Corporation 675 Jan 6, 2023
TensorFlow Lite, Coral Edge TPU samples (Python/C++, Raspberry Pi/Windows/Linux).

TensorFlow Lite, Coral Edge TPU samples (Python/C++, Raspberry Pi/Windows/Linux).

Nobuo Tsukamoto 87 Nov 16, 2022
AI-related samples made available by the DevTech ProViz team

ProViz-AI Samples This repository is a collection of AI-related samples, developed and provided by the DevTech ProViz team. Each folder in the reposit

NVIDIA Corporation 7 Nov 18, 2022
Implement yolov5 with Tensorrt C++ api, and integrate batchedNMSPlugin. A Python wrapper is also provided.

yolov5 Original codes from tensorrtx. I modified the yololayer and integrated batchedNMSPlugin. A yolov5s.wts is provided for fast demo. How to genera

weiwei zhou 46 Dec 6, 2022
DLPrimitives/OpenCL out of tree backend for pytorch

Pytorch OpenCL backend based on dlprimitives DLPrimitives-OpenCL out of tree backend for pytorch It is only beginning, but you can train some vision n

Artyom Beilis 89 Dec 27, 2022
copc-lib provides an easy-to-use interface for reading and creating Cloud Optimized Point Clouds

copc-lib copc-lib is a library which provides an easy-to-use reader and writer interface for COPC point clouds. This project provides a complete inter

Rock Robotic 25 Nov 29, 2022
International Business Machines 10 Dec 20, 2022
Cartographer is a system that provides real-time simultaneous localization and mapping (SLAM) in 2D and 3D across multiple platforms and sensor configurations.

Cartographer Purpose Cartographer is a system that provides real-time simultaneous localization and mapping (SLAM) in 2D and 3D across multiple platfo

Cartographer 6.3k Jan 4, 2023
Deep Learning in C Programming Language. Provides an easy way to create and train ANNs.

cDNN is a Deep Learning Library written in C Programming Language. cDNN provides functions that can be used to create Artificial Neural Networks (ANN)

Vishal R 12 Dec 24, 2022
PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.

PSTensor : Custimized a Tensor Data Structure Compatible with PyTorch and TensorFlow. You may need this software in the following cases. Manage memory

Jiarui Fang 8 Feb 12, 2022
Mobile Robot Programming Toolkit (MRPT) provides C++ libraries aimed at researchers in mobile robotics and computer vision

The MRPT project 1. Introduction Mobile Robot Programming Toolkit (MRPT) provides C++ libraries aimed at researchers in mobile robotics and computer v

MRPT 1.6k Dec 24, 2022
Watertight Manifold Python Wrapper

Watertight Manifold Python Wrapper This repository is a simple PythonWrapper around the origin implementation of the paper: Huang, Jingwei, Hao Su, an

Photogrammetry & Robotics Bonn 20 Dec 27, 2022
A simple ros wrapper for apriltag-cpp

Ros wrapper for apriltags-cpp Ros wrapper of the APRIL tags library, using OpenCV (and optionally, CGAL). Requirements Currently, apriltags-cpp requir

Robot Perception & Navigation Group (RPNG) 6 Dec 30, 2021
ROS wrapper for real-time incremental event-based vision motion estimation by dispersion minimisation

event_emin_ros ROS wrapper for real-time incremental event-based vision motion estimation by dispersion minimisation (EventEMin). This code was used t

Imperial College London 2 Jan 10, 2022
Python wrapper for Environment Simulator Minimalistic (esmini)

python-esmini is a python wrapper for Environment Simulator Minimalistic (esmini). Install the package python-esmini is now only available for the Lin

Hamid Ebadi 6 Aug 4, 2022
DarkHelp - C++ wrapper library for Darknet

What is the DarkHelp C++ API? The DarkHelp C++ API is a wrapper to make it easier to use the Darknet neural network framework within a C++ application

Stéphane Charette 89 Dec 12, 2022
C++11 wrapper for the LMDB embedded B+ tree database library.

lmdb++: a C++11 wrapper for LMDB This is a comprehensive C++ wrapper for the LMDB embedded database library, offering both an error-checked procedural

D.R.Y. C++ 263 Dec 27, 2022
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales

Fairring (FAIR + Herring): a faster all-reduce TL;DR: Using a variation on Amazon’s "Herring" technique, which leverages reduction servers, we can per

Meta Research 46 Nov 24, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

JoliBrain 2.4k Dec 30, 2022