TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

Overview

License Documentation

TensorRT Open Source Software

This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for TensorRT plugins and parsers (Caffe and ONNX), as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.

Build

Prerequisites

To build the TensorRT-OSS components, you will first need the following software packages.

TensorRT GA build

System Packages

Optional Packages

Downloading TensorRT Build

  1. Download TensorRT OSS

    git clone -b master https://github.com/nvidia/TensorRT TensorRT
    cd TensorRT
    git submodule update --init --recursive
  2. (Optional - if not using TensorRT container) Specify the TensorRT GA release build

    If using the TensorRT OSS build container, TensorRT libraries are preinstalled under /usr/lib/x86_64-linux-gnu and you may skip this step.

    Else download and extract the TensorRT GA build from NVIDIA Developer Zone.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4

    cd ~/Downloads
    tar -xvzf TensorRT-8.2.1.8.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
    export TRT_LIBPATH=`pwd`/TensorRT-8.2.1.8

    Example: Windows on x86-64 with cuda-11.4

    cd ~\Downloads
    Expand-Archive .\TensorRT-8.2.1.8.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.1.8'
    $Env:PATH += 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\'
  3. (Optional - for Jetson builds only) Download the JetPack SDK

    1. Download and launch the JetPack SDK manager. Login with your NVIDIA developer account.
    2. Select the platform and target OS (example: Jetson AGX Xavier, Linux Jetpack 4.6), and click Continue.
    3. Under Download & Install Options change the download folder and select Download now, Install later. Agree to the license terms and click Continue.
    4. Move the extracted files into the /docker/jetpack_files folder.

Setting Up The Build Environment

For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, on Windows for example, please install the prerequisite System Packages.

  1. Generate the TensorRT-OSS build container.

    The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build script. The build container is configured for building TensorRT OSS out-of-the-box.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4.2 (default)

    ./docker/build.sh --file docker/ubuntu-18.04.Dockerfile --tag tensorrt-ubuntu18.04-cuda11.4

    Example: CentOS/RedHat 7 on x86-64 with cuda-10.2

    ./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda10.2 --cuda 10.2

    Example: Ubuntu 18.04 cross-compile for Jetson (aarch64) with cuda-10.2 (JetPack SDK)

    ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda10.2 --cuda 10.2

    Example: Ubuntu 20.04 on aarch64 with cuda-11.4.2

    ./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.4
  2. Launch the TensorRT-OSS build container.

    Example: Ubuntu 18.04 build container

    ./docker/launch.sh --tag tensorrt-ubuntu18.04-cuda11.4 --gpus all

    NOTE:

    1. Use the --tag corresponding to build container generated in Step 1.
    2. NVIDIA Container Toolkit is required for GPU access (running TensorRT applications) inside the build container.
    3. sudo password for Ubuntu build containers is 'nvidia'.
    4. Specify port number using --jupyter for launching Jupyter notebooks.

Building TensorRT-OSS

  • Generate Makefiles or VS project (Windows) and build.

    Example: Linux (x86-64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out
     make -j$(nproc)

    NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:

    yum -y install centos-release-scl
    yum-config-manager --enable rhel-server-rhscl-7-rpms
    yum -y install devtoolset-8
    export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}

    Example: Linux (aarch64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
     make -j$(nproc)

    Example: Native build on Jetson (aarch64) with cuda-10.2

    cd $TRT_OSSPATH
    mkdir -p build && cd build
    cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=10.2
    CC=/usr/bin/gcc make -j$(nproc)

    NOTE: C compiler must be explicitly specified via CC= for native aarch64 builds of protobuf.

    Example: Ubuntu 18.04 Cross-Compile for Jetson (aarch64) with cuda-10.2 (JetPack)

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
     make -j$(nproc)

    NOTE: The latest JetPack SDK v4.6 only supports TensorRT 8.0.1.

    Example: Windows (x86-64) build in Powershell

     cd $Env:TRT_OSSPATH
     mkdir -p build ; cd build
     cmake .. -DTRT_LIB_DIR=$Env:TRT_LIBPATH -DTRT_OUT_DIR='$(Get-Location)\out' -DCMAKE_TOOLCHAIN_FILE=..\cmake\toolchains\cmake_x64_win.toolchain
     msbuild ALL_BUILD.vcxproj

    NOTE:

    1. The default CUDA version used by CMake is 11.4.2. To override this, for example to 10.2, append -DCUDA_VERSION=10.2 to the cmake command.
    2. If samples fail to link on CentOS7, create this symbolic link: ln -s $TRT_OUT_DIR/libnvinfer_plugin.so $TRT_OUT_DIR/libnvinfer_plugin.so.8
  • Required CMake build arguments are:

    • TRT_LIB_DIR: Path to the TensorRT installation directory containing libraries.
    • TRT_OUT_DIR: Output directory where generated build artifacts will be copied.
  • Optional CMake build arguments:

    • CMAKE_BUILD_TYPE: Specify if binaries generated are for release or debug (contain debug symbols). Values consists of [Release] | Debug
    • CUDA_VERISON: The version of CUDA to target, for example [11.4.2].
    • CUDNN_VERSION: The version of cuDNN to target, for example [8.2].
    • PROTOBUF_VERSION: The version of Protobuf to use, for example [3.0.0]. Note: Changing this will not configure CMake to use a system version of Protobuf, it will configure CMake to download and try building that version.
    • CMAKE_TOOLCHAIN_FILE: The path to a toolchain file for cross compilation.
    • BUILD_PARSERS: Specify if the parsers should be built, for example [ON] | OFF. If turned OFF, CMake will try to find precompiled versions of the parser libraries to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_PLUGINS: Specify if the plugins should be built, for example [ON] | OFF. If turned OFF, CMake will try to find a precompiled version of the plugin library to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_SAMPLES: Specify if the samples should be built, for example [ON] | OFF.
    • GPU_ARCHS: GPU (SM) architectures to target. By default we generate CUDA code for all major SMs. Specific SM versions can be specified here as a quoted space-separated list to reduce compilation time and binary size. Table of compute capabilities of NVIDIA GPUs can be found here. Examples:
      • NVidia A100: -DGPU_ARCHS="80"
      • Tesla T4, GeForce RTX 2080: -DGPU_ARCHS="75"
      • Titan V, Tesla V100: -DGPU_ARCHS="70"
      • Multiple SMs: -DGPU_ARCHS="80 75"
    • TRT_PLATFORM_ID: Bare-metal build (unlike containerized cross-compilation) on non Linux/x86 platforms must explicitly specify the target platform. Currently supported options: x86_64 (default), aarch64

References

TensorRT Resources

Known Issues

Comments
  • How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    All right, so, I have a PyTorch detector SSD with MobileNet. Since I failed to convert model with NMS in it (to be more precise, I converted it, but TRT engine is built in a wrong way with that .onnx file), I decided to leave NMS part to TRT.

    In general, there are several ways to add NMS in TRT:

    1. Use graphsurgeon with TensorFlow model and add NMS as graphsurgeon.create_plugin_node
    2. Use CPP code for plugin (https://github.com/NVIDIA/TensorRT/tree/master/plugin/batchedNMSPlugin)
    3. Use DeepStream that has NMS plugin

    But, I have a PyTorch model that I converted to onnx and then to TRT without any CPP code (Python only). My question is very simple: how can I combine my current pipeline with the CPP plugin for NMS?

    Component: Plugins good-reference triaged 
    opened by ivanpanshin 83
  • [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2
    &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d ~/data
    

    ubuntu16.04 TensorRT 6.x (build source from git branch release/6.0) following tutorial converts matterport maskrcnn model successfully to uff, inference got this result.

    Component: UFF unsupported-op Samples 
    opened by jinfagang 52
  • tensort7 load onnx resize ops error

    tensort7 load onnx resize ops error

    Description

    when i load onnx model, fpn F.interpolate ops error


    While parsing node number 209 [Resize]: ERROR: builtin_op_importers.cpp:2412 In function importResize: [8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"


    this error in onnx-tensorrt

    Environment

    TensorRT Version: 7.0 GPU Type: 1060 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.5.32 Operating System + Version: win10

    OS: Windows Component: ONNX Framework: PyTorch good-reference Release: 7.x triaged 
    opened by syshensyshen 51
  • how to create an engine serve for multiple source inputs?

    how to create an engine serve for multiple source inputs?

    How can i create 1 engine (ex: 1 tensorrt detector engine) that can serve for 6 or 10 camera to detect object by using threading without get confused output between these source?

    question ask-the-experts triaged 
    opened by HoangTienDuc 45
  • How to add NMS with Tensorflow Model (that was converted to ONNX)

    How to add NMS with Tensorflow Model (that was converted to ONNX)

    I have taken an ssdlite mobile net v2 model from the tensorflow model zoo

    steps :

    1. generated the onnx model using the tf2onnx lib python -m tf2onnx.convert --graphdef mv2/ssdlite_mobilenet_v2_coco_2018_05_09/frozen_inference_graph.pb --output MODEL_frozen.onnx \ --fold_const --opset 11 \ --inputs image_tensor:0 \ --outputs num_detections:0,detection_boxes:0,detection_scores:0,detection_classes:0

    2. add the nms layers in the onnx model based on refferences from this issue

    import onnx_graphsurgeon as gs
    import onnx
    import numpy as np
    
    input_model_path = "MODEL_frozen.onnx"
    output_model_path = "model_gs.onnx"
    
    @gs.Graph.register()
    def trt_batched_nms(self, boxes_input, scores_input, nms_output,
                        share_location, num_classes):
    
        boxes_input.outputs.clear()
        scores_input.outputs.clear()
        nms_output.inputs.clear()
    
        attrs = {
            "shareLocation": share_location,
            "numClasses": num_classes,
            "backgroundLabelId": 0,
            "topK": 116740,
            "keepTopK": 100,
            "scoreThreshold": 0.3,
            "iouThreshold": 0.6,
            "isNormalized": True,
            "clipBoxes": True
        }
        return self.layer(op="BatchedNMS_TRT", attrs=attrs,
                          inputs=[boxes_input, scores_input],
                          outputs=[nms_output])
    
    
    graph = gs.import_onnx(onnx.load(input_model_path))
    graph.inputs[0].shape=[1,300,300,3]
    print(graph.inputs[0].shape)
    
    for inp in graph.inputs:
        inp.dtype = np.int
    
    input = graph.inputs[0]
    
    tmap = graph.tensors()
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    
    
    # Remove unused nodes, and topologically sort the graph.
    # graph.cleanup()
    # graph.toposort()
    # graph.fold_constants().cleanup()
    
    # Export the ONNX graph from graphsurgeon
    onnx.checker.check_model(gs.export_onnx(graph))
    onnx.save_model(gs.export_onnx(graph), output_model_path)
    
    print("Saving the ONNX model to {}".format(output_model_path))
    
    

    I am not able to figure it out in the onnx graph which nodes i should repalce in place of "Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0" and other

    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
          
    tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    

    MODEL_frozen.onnx.zip

    I have also attach the onnx file. Any sugeestions how to find it ?

    Component: ONNX Topic: ONNX Plugin triaged 
    opened by letdivedeep 43
  • BERT fp16 accuracy problem

    BERT fp16 accuracy problem

    Description

    When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?

    Environment

    TensorRT Version: 7.2.1 NVIDIA GPU: T4 NVIDIA Driver Version: 440.59 CUDA Version: 10.2 CUDNN Version: 8.0.4 Operating System: centos7 Python Version (if applicable): 3.6 Tensorflow Version (if applicable): 1.15.4 PyTorch Version (if applicable): Baremetal or Container (if so, version):

    Steps To Reproduce

    Proceed as follows: 1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine 2、when trt building, set these parameters: with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) ... 3、at the same time, I also tried to set the accuracy on these layers(such as: LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on):
    network.get_layer(i).precision = trt.DataType.FLOAT BUT no effect

    I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure

    Precision: FP16 Release: 7.x triaged 
    opened by chenzhanyiczy 42
  • Onnx Dynamic input to TensorRT

    Onnx Dynamic input to TensorRT

    [TensorRT] INTERNAL ERROR: Assertion failed: aMatrix.second == bMatrix.first ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp:35 Aborting... [TensorRT] ERROR: ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp (35) - Assertion Error in assertDimsOkayForMatrixMultiplyLayer: 0 (aMatrix.second == bMatrix.first)

    Topic: Dynamic Shape triaged Runtime: Error 
    opened by BarryKCL 41
  • ONNX networks can't use INT8 calibration and batching

    ONNX networks can't use INT8 calibration and batching

    Description

    This is due to mutually incompatible changes in the TRT7 release:

    https://docs.nvidia.com/deeplearning/sdk/tensorrt-release-notes/tensorrt-7.html

    ONNX parser with dynamic shapes support The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

    versus

    Known Issues The INT8 calibration does not work with dynamic shapes. To workaround this issue, ensure there are two passes in the code: Using a fixed shape input to build the engine in the first pass, allows TensorRT to generate the calibration cache.

    This means the ONNX network must be exported at a fixed batch size in order to get INT8 calibration working, but now it's no longer possible to specify the batch size. I also verified that manually fixing up the inputs with setDimensions(...-1...) does not work, you will hit an assertion mg.nodes[mg.regionIndices[outputRegion]].size ==mg.nodes[mg.regionIndices[inputRegion]].size while building.

    One would think there might be sort of a workaround by exporting two different networks, one with a fixed batch size and a second one with a dynamic_axis, and then using the calibration from one for the other. ~~However, even here there are severe pitfalls: a calibration cache that is generated for, say, batch_size=1 won't necessarily work for larger batch sizes, presumably because they will generate a different convolution strategy that causes different accuracy issues.~~ Edit: This might've been another issue.

    Lastly, the calibrator itself appears to be using implicit batch sizes, and breaks on batch size > 1 as follows:

    TRT: Starting Calibration with batch size 16. Calibrated 16 images. TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: C:\source\builder\cudnnCalibrator.cpp (707) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\builder\cudnnCalibrator.cpp (703) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\rtSafe\cuda\caskConvolutionRunner.cpp (233) - Cuda Error in nvinfer1::rt::task::CaskConvolutionRunner::allocateContextResources: 700 (an illegal memory access was encountered) TRT: FAILED_EXECUTION: Unknown exception TRT: Calibrated batch 0 in 2.62865 seconds. Cuda failure: 700

    with batch_size == 1, it's also hitting assertions:

    TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: Assertion failed: d.nbDims >= 1 C:\source\rtSafe\safeHelpers.cpp:419 Aborting...

    The combination of all these failures means that you can't really use ONNX networks in INT8 mode, at least the "Using a fixed shape input to build the engine in the first pass" recommendation hits all kinds of internal assertions as you can see above.

    Environment

    TensorRT Version: 7.0.0.11 GPU Type: RTX 2080 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.0.5 Operating System + Version: Windows 10 Python Version (if applicable): 3.6 TensorFlow Version (if applicable): PyTorch Version (if applicable): 1.3 stable Baremetal or Container (if container which image + tag): bare

    Relevant Files

    Steps To Reproduce

    bug Component: ONNX Precision: INT8 Release: 7.x 
    opened by gcp 38
  • trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    Hi, I have used samleUffMaskrcnn for my own dataset, it worked, but the results are different. The result of trt samleUffMaskrcnn depends much on anchor scales and anchor ratios, i set the same params in both test codes. The keras one performs better, as it can show more object(instances), but some object in trt maskrcnn can't be detected, especially Slender object, like a pole. Thanks for help

    opened by hwh-hit 37
  • [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    while I try to run the maskrcnn demo following this page

    Ubuntu 16.04.6 CUDA 10.1.168 tensorrt 5.1.5.0 uff 0.6.3

    Traceback (most recent call last):
      File "mrcnn_to_trt_single.py", line 165, in <module>
        main()
      File "mrcnn_to_trt_single.py", line 123, in main
        text=True, list_nodes=list_nodes)
      File "mrcnn_to_trt_single.py", line 158, in convert_model
        debug_mode = False
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 233, in from_tensorflow_frozen_model
        return from_tensorflow(graphdef, output_nodes, preprocessor, **kwargs)
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 108, in from_tensorflow
        pre.preprocess(dynamic_graph)
      File "./config.py", line 123, in preprocess
        connect(dynamic_graph, timedistributed_connect_pairs)
      File "./config.py", line 113, in connect
        if node_a_name not in dynamic_graph.node_map[node_b_name].input:
    KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1'
    
    Component: UFF Samples Release: 6.x Framework: TensorFlow Conversion: UFF good-reference 
    opened by seanyuner 36
  • (Upsample) How can I use onnx parser with opset 11 ?

    (Upsample) How can I use onnx parser with opset 11 ?

    Description

    onnx-parser is basically built with ir_version 3, opset 7 (https://github.com/onnx/onnx-tensorrt/blob/master/onnx_trt_backend.cpp)

    Is there any way to use onnx parser with opset 11 support ?

    I mean, parser works only with opset7 version. parser works well if I use ir4_opset7 version onnx model, but doesn't work if I use ir4_opset11 version onnx model.

    It also cannot parse opset 8 and 9.

    My onnx models are made by pytorch 1.4.0a.

    Can I rebuild the parser by changing only the BACKEND_OPSET constant inside onnx_trt_backend.cpp?

    Environment

    TensorRT Version: 7.0.0 GPU Type: T4 Nvidia Driver Version: 440.33.01 CUDA Version: 10.2.89 CUDNN Version: 7.6.5 Operating System + Version: Ubuntu18.04 Python Version (if applicable): 3.6.9 TensorFlow Version (if applicable): 1.4.0 PyTorch Version (if applicable): 1.4.0a

    API: Python Framework: PyTorch Conversion: torch.onnx Release: 7.x TODO 
    opened by dhkim0225 34
  •  Assertion failed: creator &&

    Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

    Description

    Environment

    TensorRT Version: trt7 NVIDIA GPU: 3060 NVIDIA Driver Version: 515.57 CUDA Version: 11.7 CUDNN Version: Operating System: ubuntu18.04 Python Version (if applicable): 3.9 Tensorflow Version (if applicable): None PyTorch Version (if applicable): 1.10 Baremetal or Container (if so, version):

    Relevant Files

    Hello,I want to use the pipeline "pytorch to onnx to tensorrt" to convert a unsupported op by both onnx and trt。So I use a fake onnx op to complete the process from pytorch to onnx。After this,the onnx model looks like this: image

    After this,I copy the msda from trt8 as a plugin in trt7。When I convert the model from onnx to tensorrt,an error occurs: image

    Then I checked the pluginType、pluginVersion and pluginName,they are correct: image image

    And in cmakelist,I have: image

    So what can I do to solve this problem?

    Steps To Reproduce

    opened by liuxubit 1
  • Error with multiple optimization profiles when fetching runtime dimensions. Assertion slots.size() >= static_cast<size_t>(code.nbSlots) failed. insufficient number of slots provided.

    Error with multiple optimization profiles when fetching runtime dimensions. Assertion slots.size() >= static_cast(code.nbSlots) failed. insufficient number of slots provided.

    Description

    After compiling our model with multiple optimization profiles, we create multiple execution contexts for each profile for inference. When using one of the contexts and trying to call context.get_tensor_shape to get the shape of our output, after we have set our input shape with context.set_input_shape, we see the following error:

    Error Code 2: Internal Error (Assertion slots.size() >= static_cast<size_t>(code.nbSlots) failed. insufficient number of slots provided)
    (0) ## this is the dimension printed.
    

    However, when compiling our model with one optimization profile and running the same code, we do not see any issues as the output dimensions are printed properly.

    Could someone explain the error message a bit more and suggest a path toward resolution?

    Environment

    TensorRT Version: 8.5.1.7 NVIDIA GPU: Tesla T4 NVIDIA Driver Version: 510.73.08 CUDA Version: 11.6 CUDNN Version: 8.6 Operating System: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.10 Baremetal or Container (if so, version):

    Relevant Files

    Steps To Reproduce

    1. deserialized engine: engine = runtime.deserialize_cuda_engine(serialized_engine)
    2. Create execution context: context = engine.create_execution_context()
    3. Set our input shape and print dimensions as follows
        in0 = engine.get_tensor_name(0)
        in1 = engine.get_tensor_name(1)
        out1 = engine.get_tensor_name(2)
        out2 = engine.get_tensor_name(3)
    
        context.set_input_shape(in0, (1, 10))
        context.set_input_shape(in1, (1,))
        print(context.get_tensor_shape(out1))
    
    opened by gkang2018 0
  • Can not convert auto-regression model to TensorRT engine

    Can not convert auto-regression model to TensorRT engine

    TensorRT Version: 8.2.3.0

    When I convert an auto-regression with while_loop operator model to tensorrt engine with trtexec, it gives the following error:

    [01/03/2023-08:43:00] [E] [TRT] parsers/onnx/ModelImporter.cpp:783: --- End node --- [01/03/2023-08:43:00] [E] [TRT] parsers/onnx/ModelImporter.cpp:785: ERROR: parsers/onnx/ModelImporter.cpp:166 In function parseGraph: [6] Invalid Node - generic_loop_Loop__352 [graphShapeAnalyzer.cpp::processCheck::581] Error Code 4: Internal Error ((Unnamed Layer* 3582) [Recurrence]: inputs to IRecurrenceLayer have different dimensions. First input has dimensions [3,1] and second input has dimensions [3,2]. ) [graphShapeAnalyzer.cpp::processCheck::581] Error Code 4: Internal Error ((Unnamed Layer* 3582) [Recurrence]: inputs to IRecurrenceLayer have different dimensions. First input has dimensions [3,1] and second input has dimensions [3,2]. ) [01/03/2023-08:43:00] [E] Failed to parse onnx file [01/03/2023-08:43:00] [E] Parsing model failed [01/03/2023-08:43:00] [E] Failed to create engine from model. [01/03/2023-08:43:00] [E] Engine set up failed

    It seems that the input to the while_loop must be constant shape? How can I solve this problem?

    opened by yjiangling 0
  • stable diffusion demo ,run error

    stable diffusion demo ,run error

    Description

    when run demo-diffusion.py,met error.

    Environment

    used the docker provided ,nvcr.io/nvidia/tensorrt :22.10-py3

    TensorRT Version: 8.5.0.12 NVIDIA GPU: V100 NVIDIA Driver Version: 515.43.04 CUDA Version: 11.8 CUDNN Version: None Operating System: Ubantu Python Version (if applicable): 3.8.10 Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.12.0+cu116 Baremetal or Container (if so, version):

    Relevant Files

    [I] Total Nodes | Original: 1251, After Folding: 1078 | 173 Nodes Folded [I] Folding Constants | Pass 3 [I] Total Nodes | Original: 1078, After Folding: 1078 | 0 Nodes Folded CLIP: fold constants .. 1078 nodes, 1812 tensors, 1 inputs, 1 outputs CLIP: shape inference .. 1078 nodes, 1812 tensors, 1 inputs, 1 outputs CLIP: removed 12 casts .. 1054 nodes, 1788 tensors, 1 inputs, 1 outputs CLIP: inserted 25 LayerNorm plugins .. 842 nodes, 1526 tensors, 1 inputs, 1 outputs CLIP: final .. 842 nodes, 1526 tensors, 1 inputs, 1 outputs Building TensorRT engine for onnx/clip.opt.onnx: engine/clip.plan [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars [W] parsers/onnx/onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [E] parsers/onnx/ModelImporter.cpp:740: While parsing node number 7 [LayerNorm -> "LayerNormV-0"]: [E] parsers/onnx/ModelImporter.cpp:741: --- Begin node --- [E] parsers/onnx/ModelImporter.cpp:742: input: "input.7" input: "LayerNormGamma-0" input: "LayerNormBeta-0" output: "LayerNormV-0" name: "LayerNormN-0" op_type: "LayerNorm" attribute { name: "epsilon" f: 1e-05 type: FLOAT } [E] parsers/onnx/ModelImporter.cpp:743: --- End node --- [E] parsers/onnx/ModelImporter.cpp:745: ERROR: parsers/onnx/builtin_op_importers.cpp:5365 In function importFallbackPluginImporter: [8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?" [E] In node 7 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?" [!] Could not parse ONNX correctly Traceback (most recent call last): File "demo-diffusion.py", line 482, in demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset, File "demo-diffusion.py", line 241, in loadEngines engine.build(onnx_opt_path, fp16=True,
    File "/workspace/demo/Diffusion/utilities.py", line 72, in build engine = engine_from_network(network_from_onnx_path(onnx_path), config=CreateConfig(fp16=fp16, profiles=[p], File "", line 3, in func_impl File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py", line 42, in call return self.call_impl(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 183, in call_impl trt_util.check_onnx_parser_errors(parser, success) File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/util.py", line 85, in check_onnx_parser_errors G_LOGGER.critical("Could not parse ONNX correctly") File "/usr/local/lib/python3.8/dist-packages/polygraphy/logger/logger.py", line 597, in critical raise PolygraphyException(message) from None polygraphy.exception.exception.PolygraphyException: Could not parse ONNX correctly

    triaged Demo: Diffusion 
    opened by xiaohaipeng 2
  • [shuffleBuilder.cpp::addSupportedFormats::50] Error Code 2: Internal Error (Assertion formats.nbInputs() == 1 || formats.nbInputs() == 2 failed.)

    [shuffleBuilder.cpp::addSupportedFormats::50] Error Code 2: Internal Error (Assertion formats.nbInputs() == 1 || formats.nbInputs() == 2 failed.)

    Description

    my model is trained in pytorch then quantized using pytorch-quantization in tensorrt/tools then exported to onnx then build engine from onnx

    the error in title occured when i parsed onnx model to build engine in my jetson xaiver nx (i tested this pipeline in gpu tensorrt 8.5, and no error occured) this is the log context:

    [TensorRT] VERBOSE: Eliminating concatenation node_of_outputs_coords
    [TensorRT] VERBOSE: Generating copy for 15813 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15815 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15817 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15819 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15821 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15823 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: After concat removal: 3085 layers
    [TensorRT] VERBOSE: Graph construction and optimization completed in 220.024 seconds.
    [TensorRT] INFO: ---------- Layers Running on DLA ----------
    [TensorRT] INFO: ---------- Layers Running on GPU ----------
    [TensorRT] INFO: [GpuLayer] node_of_1001_quantize_scale_node
    [TensorRT] INFO: [GpuLayer] node_of_inputs
    ...##(too much layers output, more than 3000 layers, )
    [TensorRT] INFO: [GpuLayer] node_of_14036
    [TensorRT] INFO: [GpuLayer] node_of_15819
    [TensorRT] INFO: [GpuLayer] node_of_13155
    [TensorRT] INFO: [GpuLayer] node_of_15817
    [TensorRT] INFO: [GpuLayer] node_of_12274
    [TensorRT] INFO: [GpuLayer] node_of_15815
    [TensorRT] INFO: [GpuLayer] 15813 copy
    [TensorRT] INFO: [GpuLayer] 15815 copy
    [TensorRT] INFO: [GpuLayer] 15817 copy
    [TensorRT] INFO: [GpuLayer] 15819 copy
    [TensorRT] INFO: [GpuLayer] 15821 copy
    [TensorRT] INFO: [GpuLayer] 15823 copy
    [TensorRT] VERBOSE: Using cublas a tactic source
    [TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +227, GPU +209, now: CPU 978, GPU 4493 (MiB)
    [TensorRT] VERBOSE: Using cuDNN as a tactic source
    [TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +307, GPU +306, now: CPU 1285, GPU 4799 (MiB)
    [TensorRT] WARNING: Detected invalid timing cache, setup a local cache instead
    [TensorRT] VERBOSE: Constructing optimization profile number 0 [1/1].
    [TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1285, GPU 4811 (MiB)
    [TensorRT] ERROR: 2: [shuffleBuilder.cpp::addSupportedFormats::50] Error Code 2: Internal Error (Assertion formats.nbInputs() == 1 || formats.nbInputs() == 2 failed.)
    [TensorRT] ERROR: 2: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
    

    this error seems to be an internal error reported from tensorrt, i searched this error in google, yet found no meaningful information, i want to know which layer causes this error, then i can use some ops to substitute, but i cannot understand what does this error mean, this is where I need help

    Environment

    TensorRT Version: 8.0.1.6 NVIDIA GPU: jetson NVIDIA Driver Version: jetpack 5.0 CUDA Version: 10.2 Operating System: Ubuntu 18.04.5 LTS Python Version (if applicable): 3.6.9 PyTorch Version (if applicable): 1.11.0a0+17540c5 Baremetal or Container (if so, version):

    Relevant Files

    Steps To Reproduce

    opened by pupumao 0
Releases(8.5.2)
  • 8.5.2(Dec 14, 2022)

  • 22.12(Dec 8, 2022)

    Commit used by the 22.12 TensorRT NGC container.

    Added

    • Stable Diffusion demo using TensorRT Plugins
    • KV-cache and beam search to GPT2 and T5 demos
    • Perplexity calculation to all HF demos

    Changed

    • Updated trex to v0.1.5
    • Increased default workspace size in demoBERT to build BS=128 fp32 engines
    • Use avg_iter=8 and timing cache to make demoBERT perf more stable

    Removed

    • None
    Source code(tar.gz)
    Source code(zip)
  • 8.5.1(Nov 2, 2022)

    TensorRT OSS release corresponding to TensorRT 8.5.1.7 GA release.

    Key Features and Updates:

    • Samples enhancements

      • Added sampleNamedDimensions which works with named dimensions.
      • Updated sampleINT8API and introductory_parser_samples to use ONNX models over Caffe/UFF
      • Removed UFF/Caffe samples including sampleMNIST, end_to_end_tensorflow_mnist, sampleINT8, sampleMNISTAPI, sampleUffMNIST, sampleUffPluginV2Ext, engine_refit_mnist, int8_caffe_mnist, uff_custom_plugin, sampleFasterRCNN, sampleUffFasterRCNN, sampleGoogleNet, sampleSSD, sampleUffSSD, sampleUffMaskRCNN and uff_ssd.
    • Plugin enhancements

      • Added GridAnchorRectPlugin to support rectangular feature maps in gridAnchorPlugin.
      • Added ROIAlignPlugin to support the ONNX operator RoiAlign. The ONNX parser will automatically route ROIAlign ops through the plugin.
      • Added Hopper support for the BERTQKVToContextPlugin plugin.
      • Exposed the use_int8_scale_max attribute in the BERTQKVToContextPlugin plugin to allow users to disable the by-default usage of INT8 scale factors to optimize softmax MAX reduction in versions 2 and 3 of the plugin.
    • ONNX-TensorRT changes

    • Build containers

      • Updated default cuda versions to 11.8.0.
    • Tooling enhancements

    Source code(tar.gz)
    Source code(zip)
  • 8.4.3(Aug 19, 2022)

  • 22.08(Aug 17, 2022)

    Commit used by the 22.08 TensorRT NGC container.

    Changelog

    Updated TensorRT version to 8.4.2 - see the TensorRT 8.4.2 release notes for more information

    Changed

    • Updated default protobuf version to 3.20.x
    • Updated ONNX-TensorRT submodule version to 22.08 tag
    • Updated sampleIOFormats and sampleAlgorithmSelector to use ONNX models over Caffe

    Fixes

    • Fixed missing serialization member in CustomClipPlugin plugin
    • Fixed various Python import issues

    Added

    • Added new DeBERTA demo
    • Added version 2 for disentangledAttentionPlugin to support DeBERTA v2

    Removed

    • None
    Source code(tar.gz)
    Source code(zip)
  • 22.07(Jul 22, 2022)

    Commit used by the 22.07 TensorRT NGC container.

    Changelog

    Added

    • polygraphy-trtexec-plugin tool for Polygraphy
    • Multi-profile support for demoBERT
    • KV cache support for HF BART demo

    Changed

    • Updated ONNX-GS to v0.3.20

    Removed

    • None
    Source code(tar.gz)
    Source code(zip)
  • 8.4.1(Jun 14, 2022)

    TensorRT OSS release corresponding to TensorRT 8.4.1.5 GA release.

    Key Features and Updates:

    • Samples enhancements

    • EfficientDet sample

      • Added support for EfficientDet Lite and AdvProp models.
      • Added dynamic batch support.
      • Added mixed precision engine builder.
    • HuggingFace transformer demo

      • Added BART model.
      • Performance speedup of GPT-2 greedy search using GPU implementation.
      • Fixed GPT2 onnx export failure due to 2G file size limitation.
      • Extended Megatron LayerNorm plugins to support larger hidden sizes.
      • Added performance benchmarking mode.
      • Enable tf32 format by default.
    • demoBERT enhancements

      • Add --duration flag to perf benchmarking script.
      • Fixed import of nvinfer_plugins library in demoBERT on Windows.
    • Torch-QAT toolkit

      • quant_bert.py module removed. It is now upstreamed to HuggingFace QDQBERT.
      • Use axis0 as default for deconv.
      • #1939 - Fixed path in classification_flow example.
    • Plugin enhancements

    • Build containers

      • Updated default cuda versions to 11.6.2.
      • CentOS Linux 8 has reached End-of-Life on Dec 31, 2021. The corresponding container has been removed from TensorRT-OSS.
      • Install devtoolset-8 for updated g++ versions in CentOS7 container.
    • Tooling enhancements

    • trtexec enhancements

      • Added --layerPrecisions and --layerOutputTypes flags for specifying layer-wise precision and output type constraints.
      • Added --memPoolSize flag to specify the size of workspace as well as the DLA memory pools via a unified interface. Correspondingly the --workspace flag has been deprecated.
      • "End-To-End Host Latency" metric has been removed. Use the “Host Latency” metric instead. For more information, refer to Benchmarking Network section in the TensorRT Developer Guide.
      • Use enqueueV2() instead of enqueue() when engine has explicit batch dimensions.
    Source code(tar.gz)
    Source code(zip)
  • 22.06(Jun 9, 2022)

    Commit used by the 22.06 TensorRT NGC container.

    Changelog

    Added

    • None

    Changed

    • Disentangled attention (DMHA) plugin refactored
    • ONNX parser updated to 8.2GA

    Removed

    • None
    Source code(tar.gz)
    Source code(zip)
  • 22.05(May 13, 2022)

    Commit used by the 22.05 TensorRT NGC container.

    Changelog

    Added

    • Disentangled attention plugin for DeBERTa
    • DMHA (multiscaleDeformableAttnPlugin) plugin for DDETR
    • Performance benchmarking mode to HuggingFace demo

    Changed

    • Updated base TensorRT version to 8.2.5.1
    • Updated onnx-graphsurgeon v0.3.19 CHANGELOG
    • fp16 support for pillarScatterPlugin
    • #1939 - Fixed path in quantization classification_flow
    • Fixed GPT2 onnx export failure due to 2G limitation
    • Use axis0 as default for deconv in pytorch-quantization toolkit
    • Updated onnx export script for CoordConvAC sample
    • Install devtoolset-8 for updated g++ version in CentOS7 container

    Removed

    • Usage of deprecated TensorRT APIs in samples removed
    • quant_bert.py module removed from pytorch-quantization
    Source code(tar.gz)
    Source code(zip)
  • 22.04(Apr 14, 2022)

    Commit used by the 22.04 TensorRT NGC container.

    Changelog

    Added

    • TensorRT Engine Explorer v0.1.0 README
    • Detectron 2 Mask R-CNN R50-FPN python sample
    • Model export script for sampleOnnxMnistCoordConvAC

    Changed

    • Updated base TensorRT version to 8.2.4.2
    • Updated copyright headers with SPDX identifiers
    • Updated onnx-graphsurgeon v0.3.17 CHANGELOG
    • PyramidROIAlign plugin refactor and bug fixes
    • Fixed MultilevelCropAndResize crashes on Windows
    • #1583 - sublicense ieee/half.h under Apache2
    • Updated demo/BERT performance tables for rel-8.2
    • #1774 Fix python hangs at IndexErrors when TF is imported after TensorRT
    • Various bugfixes in demos - BERT, Tacotron2 and HuggingFace GPT/T5 notebooks
    • Cleaned up sample READMEs

    Removed

    • sampleNMT removed from samples
    Source code(tar.gz)
    Source code(zip)
  • 22.03(Mar 24, 2022)

    Commit used by the 22.03 TensorRT NGC container.

    Changelog

    Added

    • EfficientDet sample enhancements
      • Added support for EfficientDet Lite and AdvProp models.
      • Added dynamic batch support.
      • Added mixed precision engine builder.

    Changed

    • Better decoupling of HuggingFace demo tests
    Source code(tar.gz)
    Source code(zip)
  • 22.02(Feb 4, 2022)

    Commit used by the 22.02 TensorRT NGC container.

    Changelog

    Added

    Changed

    • Extend Megatron LayerNorm plugins to support larger hidden sizes
    • Refactored EfficientNMS plugin for TFTRT and added implicit batch mode support
    • Update base TensorRT version to 8.2.3.0
    • GPT-2 greedy search speedup - now runs on GPU
    • Updates to TensorRT developer tools
    • Updated ONNX parser to v8.2.3.0
    • Minor updates and bugfixes
      • Samples: TFOD, GPT-2, demo/BERT
      • Plugins: proposalPlugin, geluPlugin, bertQKVToContextPlugin, batchedNMS

    Removed

    • Unused source file(s) in demo/BERT
    Source code(tar.gz)
    Source code(zip)
  • 22.01(Jan 24, 2022)

  • 8.2.1(Nov 24, 2021)

    TensorRT OSS release corresponding to TensorRT 8.2.1.8 GA release.

    • Updates since TensorRT 8.2.0 EA release.

    • Please refer to the TensorRT 8.2.1 GA release notes for more information.

    • ONNX parser v8.2.1

      • Removed duplicate constant layer checks that caused some performance regressions
      • Fixed expand dynamic shape calculations
      • Added parser-side checks for Scatter layer support
    • Sample updates

      • Added Tensorflow Object Detection API converter samples, including Single Shot Detector, Faster R-CNN and Mask R-CNN models
      • Multiple enhancements in HuggingFace transformer demos
        • Added multi-batch support
        • Fixed resultant performance regression in batchsize=1
        • Fixed T5 large/T5-3B accuracy issues
        • Added notebooks for T5 and GPT-2
        • Added CPU benchmarking option
      • Deprecated kSTRICT_TYPES (strict type constraints). Equivalent behaviour now achieved by setting PREFER_PRECISION_CONSTRAINTS, DIRECT_IO, and REJECT_EMPTY_ALGORITHMS
      • Removed sampleMovieLens
      • Renamed sampleReformatFreeIO to sampleIOFormats
      • Add idleTime option for samples to control qps
      • Specify default value for precisionConstraints
      • Fixed reporting of TensorRT build version in trtexec
      • Fixed combineDescriptions typo in trtexec/tracer.py
      • Fixed usages of kDIRECT_IO
    • Plugin updates

      • EfficientNMS plugin support extended to TF-TRT, and for clang builds.
      • Sanitize header definitions for BERT fused MHA plugin
      • Separate C++ and cu files in splitPlugin to avoid PTX generation (required for CUDA enhanced compatibility support)
      • Enable C++14 build for plugins
    • ONNX tooling updates

    • Build and container fixes

      • Add SM86 target to default GPU_ARCHS for platforms with cuda-11.1+
      • Remove deprecated SM_35 and add SM_60 to default GPU_ARCHS
      • Skip CUB builds for cuda 11.0+ #1455
      • Fixed cuda-10.2 container build failures in Ubuntu 20.04
      • Add native ARM server build container
      • Install devtoolset-8 for updated g++ version in CentOS7
      • Added a note on supporting c++14 builds for CentOS7
      • Fixed docker build for large UIDs #1373
      • Updated README instructions for Jetpack builds
    • demo enhancements

      • Updated Tacotron2 instructions and add CPU benchmarking
      • Fixed issues in demoBERT python notebook
    • Documentation updates

      • Updated Python documentation for add_reduce, add_top_k, and ISoftMaxLayer
      • Renamed default GitHub branch to main and updated hyperlinks
    Source code(tar.gz)
    Source code(zip)
  • 21.10(Oct 5, 2021)

    Commit used by the 21.10 TensorRT NGC container.

    Changelog

    Added

    • Benchmark script for demoBERT-Megatron
    • Dynamic Input Shape support for EfficientNMS plugin
    • Support empty dimensions in ONNX
    • INT32 and dynamic clips through elementwise in ONNX parser

    Changed

    • Bump TensorRT version to 8.0.3.4
    • Use static shape for only single batch single sequence input in demo/BERT
    • Revert to using native FC layer in demo/BERT and FCPlugin only on older GPUs.
    • Update demo/Tacotron2 for TensorRT 8.0
    • Updates to TensorRT developer tools
      • Polygraphy v0.33.0
        • Added various examples, a CLI User Guide and how-to guides.
        • Added experimental support for DLA.
        • Added a data to-input tool that can combine inputs/outputs created by --save-inputs/--save-outputs.
        • Added a PluginRefRunner which provides CPU reference implementations for TensorRT plugins
        • Made several performance improvements in the Polygraphy CUDA wrapper.
        • Removed the to-json tool which was used to convert Pickled data generated by Polygraphy 0.26.1 and older to JSON.
      • Bugfixes and documentation updates in pytorch-quantization toolkit.
    • Bumped up package versions: tensorflow-gpu 2.5.1, pillow 8.3.2
    • ONNX parser enhancements and bugfixes
      • Update ONNX submodule to v1.8.0
      • Update convDeconvMultiInput function to properly handle deconvs
      • Update RNN documentation
      • Update QDQ axis assertion
      • Fix bidirectional activation alpha and beta values
      • Fix opset10 Resize
      • Fix shape tensor unsqueeze
      • Mark BOOL tiles as unsupported
      • Remove unnecessary shape tensor checks

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 8.2.0-EA(Oct 5, 2021)

    TensorRT OSS release corresponding to TensorRT 8.2.0.6 EA release.

    Added

    • Demo applications showcasing TensorRT inference of HuggingFace Transformers.
      • Support is currently extended to GPT-2 and T5 models.
    • Added support for the following ONNX operators:
      • Einsum
      • IsNan
      • GatherND
      • Scatter
      • ScatterElements
      • ScatterND
      • Sign
      • Round
    • Added support for building TensorRT Python API on Windows.

    Updated

    • Notable API updates in TensorRT 8.2.0.6 EA release. See TensorRT Developer Guide for details.
      • Added three new APIs, IExecutionContext: getEnqueueEmitsProfile(), setEnqueueEmitsProfile(), and reportToProfiler() which can be used to collect layer profiling info when the inference is launched as a CUDA graph.
      • Eliminated the global logger; each Runtime, Builder or Refitter now has its own logger.
      • Added new operators: IAssertionLayer, IConditionLayer, IEinsumLayer, IIfConditionalBoundaryLayer, IIfConditionalOutputLayer, IIfConditionalInputLayer, and IScatterLayer.
      • Added new IGatherLayer modes: kELEMENT and kND
      • Added new ISliceLayer modes: kFILL, kCLAMP, and kREFLECT
      • Added new IUnaryLayer operators: kSIGN and kROUND
      • Added new runtime class IEngineInspector that can be used to inspect the detailed information of an engine, including the layer parameters, the chosen tactics, the precision used, etc.
      • ProfilingVerbosity enums have been updated to show their functionality more explicitly.
    • Updated TensorRT OSS container defaults to cuda 11.4
    • CMake to target C++14 builds.
    • Updated following ONNX operators:
      • Gather and GatherElements implementations to natively support negative indices
      • Pad layer to support ND padding, along with edge and reflect padding mode support
      • If layer with general performance improvements.

    Removed

    • Removed sampleMLP.
    • Several flags of trtexec have been deprecated:
      • --explicitBatch flag has been deprecated and has no effect. When the input model is in UFF or in Caffe prototxt format, the implicit batch dimension mode is used automatically; when the input model is in ONNX format, the explicit batch mode is used automatically.
      • --explicitPrecision flag has been deprecated and has no effect. When the input ONNX model contains Quantization/Dequantization nodes, TensorRT automatically uses explicit precision mode.
      • --nvtxMode=[verbose|default|none] has been deprecated in favor of --profilingVerbosity=[detailed|layer_names_only|none] to show its functionality more explicitly.

    Signed-off-by: Rajeev Rao [email protected]

    Source code(tar.gz)
    Source code(zip)
  • 21.09(Sep 22, 2021)

    Commit used by the 21.09 TensorRT NGC container.

    Changelog

    Added

    • Add ONNX2TRT_VERSION overwrite in CMake.

    Changed

    • Updates to TensorRT developer tools
    • Fix assertion in EfficientNMSPlugin

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 21.08(Aug 5, 2021)

    Commit used by the 21.08 TensorRT NGC container.

    Changelog

    Added

    Changed

    • Updated samples and plugins directory structure
    • Updates to TensorRT developer tools
    • README fix to update build command for native aarch64 builds.

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 21.07(Jul 21, 2021)

  • 8.0.1(Jul 2, 2021)

    TensorRT OSS release corresponding to TensorRT 8.0.1.6 GA release.

    Added

    • Added support for the following ONNX operators: Celu, CumSum, EyeLike, GatherElements, GlobalLpPool, GreaterOrEqual, LessOrEqual, LpNormalization, LpPool, ReverseSequence, and SoftmaxCrossEntropyLoss.
    • Rehauled Resize ONNX operator, now fully supporting the following modes:
      • Coordinate Transformation modes: half_pixel, pytorch_half_pixel, tf_half_pixel_for_nn, asymmetric, and align_corners.
      • Modes: nearest, linear.
      • Nearest Modes: floor, ceil, round_prefer_floor, round_prefer_ceil.
    • Added support for multi-input ONNX ConvTranpose operator.
    • Added support for 3D spatial dimensions in ONNX InstanceNormalization.
    • Added support for generic 2D padding in ONNX.
    • ONNX QuantizeLinear and DequantizeLinear operators leverage IQuantizeLayer and IDequantizeLayer.
      • Added support for tensor scales.
      • Added support for per-axis quantization.
    • Added EfficientNMS_TRT, EfficientNMS_ONNX_TRT plugins and experimental support for ONNX NonMaxSuppression operator.
    • Added ScatterND plugin.
    • Added TensorRT QuickStart Guide.
    • Added new samples: engine_refit_onnx_bidaf builds an engine from ONNX BiDAF model and refits engine with new weights, efficientdet and efficientnet samples for demonstrating Object Detection using TensorRT.
    • Added support for Ubuntu20.04 and RedHat/CentOS 8.3.
    • Added Python 3.9 support.

    Changed

    • Update Polygraphy to v0.30.3.
    • Update ONNX-GraphSurgeon to v0.3.10.
    • Update Pytorch Quantization toolkit to v2.1.0.
    • Notable TensorRT API updates
      • TensorRT now declares API’s with the noexcept keyword. All TensorRT classes that an application inherits from (such as IPluginV2) must guarantee that methods called by TensorRT do not throw uncaught exceptions, or the behavior is undefined.
      • Destructors for classes with destroy() methods were previously protected. They are now public, enabling use of smart pointers for these classes. The destroy() methods are deprecated.
    • Moved RefitMap API from ONNX parser to core TensorRT.
    • Various bugfixes for plugins, samples and ONNX parser.
    • Port demoBERT to tensorflow2 and update UFF samples to leverage nvidia-tensorflow1 container.

    Removed

    • IPlugin and IPluginFactory interfaces were deprecated in TensorRT 6.0 and have been removed in TensorRT 8.0. We recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt and IPluginV2IOExt interfaces. For more information, refer to Migrating Plugins From TensorRT 6.x Or 7.x To TensorRT 8.x.x.
      • For plugins based on IPluginV2DynamicExt and IPluginV2IOExt, certain methods with legacy function signatures (derived from IPluginV2 and IPluginV2Ext base classes) which were deprecated and marked for removal in TensorRT 8.0 will no longer be available.
    • Removed samplePlugin since it showcased IPluginExt interface, which is no longer supported in TensorRT 8.0.
    • Removed sampleMovieLens and sampleMovieLensMPS.
    • Removed Dockerfile for Ubuntu 16.04. TensorRT 8.0 debians for Ubuntu 16.04 require python 3.5 while minimum required python version for TensorRT OSS is 3.6.
    • Removed support for PowerPC builds, consistent with TensorRT GA releases.

    Notes

    • We had deprecated the Caffe Parser and UFF Parser in TensorRT 7.0. They are still tested and functional in TensorRT 8.0, however, we plan to remove the support in a future release. Ensure you migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT).

    Signed-off-by: Rajeev Rao [email protected]

    Source code(tar.gz)
    Source code(zip)
  • 21.06(Jun 23, 2021)

    Commit used by the 21.06 TensorRT NGC container

    Changelog

    Added

    • Add switch for batch-agnostic mode in NMS plugin
    • Add missing model.py in uff_custom_plugin sample

    Changed

    • Update to Polygraphy v0.29.2
    • Update to ONNX-GraphSurgeon v0.3.9
    • Fix numerical errors for float type in NMS/batchedNMS plugins
    • Update demoBERT input dimensions to match Triton requirement #1051
    • Optimize TLT MaskRCNN plugins:
      • enable fp16 precision in multilevelCropAndResizePlugin and multilevelProposeROIPlugin
      • Algorithms optimization for NMS kernels and ROIAlign kernel
      • Fix invalid cuda config issue when bs is larger than 32
      • Fix issues found on Jetson NANO

    Removed

    • Removed fcplugin from demoBERT to improve inference latency on GA100/Turing
    Source code(tar.gz)
    Source code(zip)
  • 21.05(May 19, 2021)

    Commit used by the 21.05 TensorRT NGC container

    Changelog

    Added

    • Extended support for ONNX operator InstanceNormalization to 5D tensors
    • Support negative indices in ONNX Gather operator
    • Add support for importing ONNX double-typed weights as float
    • ONNX-GraphSurgeon (v0.3.7) support for models with externally stored weights

    Changed

    • Update ONNX-TensorRT to 21.05
    • Relicense ONNX-TensorRT under Apache2
    • demoBERT builder fixes for multi-batch
    • Speedup demoBERT build using global timing cache and disable cuDNN tactics
    • Standardize python package versions across OSS samples
    • Bugfixes in multilevelProposeROI and bertQKV plugin
    • Fix memleaks in samples logger
    Source code(tar.gz)
    Source code(zip)
  • 21.04(Apr 12, 2021)

    Commit used by the 21.04 TensorRT NGC container

    Changelog

    Added

    • SM86 kernels for BERT MHA plugin
    • Added opset13 support for SoftMax, LogSoftmax, Squeeze, and Unsqueeze.
    • Added support for the EyeLike and GatherElements operators.

    Changed

    • Updated TensorRT version to v7.2.3.4.
    • Update to ONNX-TensorRT 21.03
    • ONNX-GraphSurgeon (v0.3.4) - updates fold_constants to correctly exit early.
    • Set default CUDA_INSTALL_DIR #798
    • Plugin bugfixes, qkv kernels for sm86
    • Fixed GroupNorm CMakeFile for cu sources #1083
    • Permit groupadd with non-unique GID in build containers #1091
    • Avoid reinterpret_cast #146
    • Clang-format plugins and samples
    • Avoid arithmetic on void pointer in multilevelProposeROIPlugin.cpp #1028
    • Update BERT plugin documentation.

    Removed

    • Removes extra terminate call in InstanceNorm
    Source code(tar.gz)
    Source code(zip)
  • 21.03(Mar 10, 2021)

  • 21.02(Feb 5, 2021)

    Commit used by the 21.02 TensorRT NGC container

    Changelog

    Added

    Changed

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 20.12(Dec 19, 2020)

    Commit used by the 20.12 TensorRT NGC container

    Changelog

    Added

    • Add configurable input size for TLT MaskRCNN Plugin

    Changed

    • Update symbol export map for plugins
    • Correctly use channel dimension when creating Prelu node
    • Fix Jetson cross compilation CMakefile

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 20.11(Nov 20, 2020)

  • 20.10(Oct 22, 2020)

    Commit used by the 20.10 TensorRT NGC container

    Changelog

    Added

    • Polygraphy v0.20.13 - Deep Learning Inference Prototyping and Debugging Toolkit
    • PyTorch-Quantization Toolkit v2.0.0
    • Updated BERT plugins for variable sequence length inputs
      • Optimized kernels for sequence lengths of 64 and 96 added
    • Added Tacotron2 + Waveglow TTS demo #677
    • Re-enable GridAnchorRect_TRT plugin with rectangular feature maps #679
    • Update batchedNMS plugin to IPluginV2DynamicExt interface #738
    • Support 3D inputs in InstanceNormalization plugin #745
    • Added this CHANGELOG.md

    Changed

    • ONNX GraphSurgeon - v0.2.7 with bugfixes, new examples.
    • demo/BERT bugfixes for Jetson Xavier
    • Updated build Dockerfile to cuda-11.1
    • Updated ClangFormat style specification according to TensorRT coding guidelines

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 7.2.1(Oct 20, 2020)

    TensorRT OSS release corresponding to TensorRT 7.2.1.6 GA build.

    Changelog

    Added

    • Polygraphy v0.20.13 - Deep Learning Inference Prototyping and Debugging Toolkit
    • PyTorch-Quantization Toolkit v2.0.0
    • Updated BERT plugins for variable sequence length inputs
      • Optimized kernels for sequence lengths of 64 and 96 added
    • Added Tacotron2 + Waveglow TTS demo #677
    • Re-enable GridAnchorRect_TRT plugin with rectangular feature maps #679
    • Update batchedNMS plugin to IPluginV2DynamicExt interface #738
    • Support 3D inputs in InstanceNormalization plugin #745
    • Added this CHANGELOG.md

    Changed

    • ONNX GraphSurgeon - v0.2.7 with bugfixes, new examples.
    • demo/BERT bugfixes for Jetson Xavier
    • Updated build Dockerfile to cuda-11.1
    • Updated ClangFormat style specification according to TensorRT coding guidelines

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 20.09(Sep 28, 2020)

A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Tencent 509 Dec 17, 2022
SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters

SHARK Communication Channels GitHub issues: Feature requests, bugs etc Nod.ai SHARK Discord server: Real time discussions with the nod.ai team and oth

nod.ai 187 Jan 1, 2023
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

NeuroSim 32 Nov 24, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 59 Dec 5, 2022
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

iwatake 192 Dec 26, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

null 939 Dec 29, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 62 Dec 14, 2022
An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

手写AI 1.5k Jan 5, 2023
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

NeuroSim 11 Dec 15, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Dec 30, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

NetEase Youdao 179 Dec 20, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

JoliBrain 2.4k Dec 30, 2022
Dataset Synthesizer - NVIDIA Deep learning Dataset Synthesizer (NDDS)

NVIDIA Deep learning Dataset Synthesizer (NDDS) Overview NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-qualit

NVIDIA Corporation 515 Dec 27, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Tencent 113 Dec 23, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

null 80 Dec 27, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

Xiaomi 4.7k Jan 3, 2023
NVIDIA GPUs htop like monitoring tool

NVTOP What is NVTOP? Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about

Maxime Schmitt 4.7k Dec 31, 2022
Inference framework for MoE layers based on TensorRT with Python binding

InfMoE Inference framework for MoE-based models, based on a TensorRT custom plugin named MoELayerPlugin (including Python binding) that can run infere

Shengqi Chen 34 Nov 25, 2022