TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

Overview

License Documentation

TensorRT Open Source Software

This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for TensorRT plugins and parsers (Caffe and ONNX), as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.

Build

Prerequisites

To build the TensorRT-OSS components, you will first need the following software packages.

TensorRT GA build

System Packages

Optional Packages

Downloading TensorRT Build

  1. Download TensorRT OSS

    git clone -b master https://github.com/nvidia/TensorRT TensorRT
    cd TensorRT
    git submodule update --init --recursive
  2. (Optional - if not using TensorRT container) Specify the TensorRT GA release build

    If using the TensorRT OSS build container, TensorRT libraries are preinstalled under /usr/lib/x86_64-linux-gnu and you may skip this step.

    Else download and extract the TensorRT GA build from NVIDIA Developer Zone.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4

    cd ~/Downloads
    tar -xvzf TensorRT-8.2.1.8.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
    export TRT_LIBPATH=`pwd`/TensorRT-8.2.1.8

    Example: Windows on x86-64 with cuda-11.4

    cd ~\Downloads
    Expand-Archive .\TensorRT-8.2.1.8.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.1.8'
    $Env:PATH += 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\'
  3. (Optional - for Jetson builds only) Download the JetPack SDK

    1. Download and launch the JetPack SDK manager. Login with your NVIDIA developer account.
    2. Select the platform and target OS (example: Jetson AGX Xavier, Linux Jetpack 4.6), and click Continue.
    3. Under Download & Install Options change the download folder and select Download now, Install later. Agree to the license terms and click Continue.
    4. Move the extracted files into the /docker/jetpack_files folder.

Setting Up The Build Environment

For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, on Windows for example, please install the prerequisite System Packages.

  1. Generate the TensorRT-OSS build container.

    The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build script. The build container is configured for building TensorRT OSS out-of-the-box.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4.2 (default)

    ./docker/build.sh --file docker/ubuntu-18.04.Dockerfile --tag tensorrt-ubuntu18.04-cuda11.4

    Example: CentOS/RedHat 7 on x86-64 with cuda-10.2

    ./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda10.2 --cuda 10.2

    Example: Ubuntu 18.04 cross-compile for Jetson (aarch64) with cuda-10.2 (JetPack SDK)

    ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda10.2 --cuda 10.2

    Example: Ubuntu 20.04 on aarch64 with cuda-11.4.2

    ./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.4
  2. Launch the TensorRT-OSS build container.

    Example: Ubuntu 18.04 build container

    ./docker/launch.sh --tag tensorrt-ubuntu18.04-cuda11.4 --gpus all

    NOTE:

    1. Use the --tag corresponding to build container generated in Step 1.
    2. NVIDIA Container Toolkit is required for GPU access (running TensorRT applications) inside the build container.
    3. sudo password for Ubuntu build containers is 'nvidia'.
    4. Specify port number using --jupyter for launching Jupyter notebooks.

Building TensorRT-OSS

  • Generate Makefiles or VS project (Windows) and build.

    Example: Linux (x86-64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out
     make -j$(nproc)

    NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:

    yum -y install centos-release-scl
    yum-config-manager --enable rhel-server-rhscl-7-rpms
    yum -y install devtoolset-8
    export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}

    Example: Linux (aarch64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
     make -j$(nproc)

    Example: Native build on Jetson (aarch64) with cuda-10.2

    cd $TRT_OSSPATH
    mkdir -p build && cd build
    cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=10.2
    CC=/usr/bin/gcc make -j$(nproc)

    NOTE: C compiler must be explicitly specified via CC= for native aarch64 builds of protobuf.

    Example: Ubuntu 18.04 Cross-Compile for Jetson (aarch64) with cuda-10.2 (JetPack)

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
     make -j$(nproc)

    NOTE: The latest JetPack SDK v4.6 only supports TensorRT 8.0.1.

    Example: Windows (x86-64) build in Powershell

     cd $Env:TRT_OSSPATH
     mkdir -p build ; cd build
     cmake .. -DTRT_LIB_DIR=$Env:TRT_LIBPATH -DTRT_OUT_DIR='$(Get-Location)\out' -DCMAKE_TOOLCHAIN_FILE=..\cmake\toolchains\cmake_x64_win.toolchain
     msbuild ALL_BUILD.vcxproj

    NOTE:

    1. The default CUDA version used by CMake is 11.4.2. To override this, for example to 10.2, append -DCUDA_VERSION=10.2 to the cmake command.
    2. If samples fail to link on CentOS7, create this symbolic link: ln -s $TRT_OUT_DIR/libnvinfer_plugin.so $TRT_OUT_DIR/libnvinfer_plugin.so.8
  • Required CMake build arguments are:

    • TRT_LIB_DIR: Path to the TensorRT installation directory containing libraries.
    • TRT_OUT_DIR: Output directory where generated build artifacts will be copied.
  • Optional CMake build arguments:

    • CMAKE_BUILD_TYPE: Specify if binaries generated are for release or debug (contain debug symbols). Values consists of [Release] | Debug
    • CUDA_VERISON: The version of CUDA to target, for example [11.4.2].
    • CUDNN_VERSION: The version of cuDNN to target, for example [8.2].
    • PROTOBUF_VERSION: The version of Protobuf to use, for example [3.0.0]. Note: Changing this will not configure CMake to use a system version of Protobuf, it will configure CMake to download and try building that version.
    • CMAKE_TOOLCHAIN_FILE: The path to a toolchain file for cross compilation.
    • BUILD_PARSERS: Specify if the parsers should be built, for example [ON] | OFF. If turned OFF, CMake will try to find precompiled versions of the parser libraries to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_PLUGINS: Specify if the plugins should be built, for example [ON] | OFF. If turned OFF, CMake will try to find a precompiled version of the plugin library to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_SAMPLES: Specify if the samples should be built, for example [ON] | OFF.
    • GPU_ARCHS: GPU (SM) architectures to target. By default we generate CUDA code for all major SMs. Specific SM versions can be specified here as a quoted space-separated list to reduce compilation time and binary size. Table of compute capabilities of NVIDIA GPUs can be found here. Examples:
      • NVidia A100: -DGPU_ARCHS="80"
      • Tesla T4, GeForce RTX 2080: -DGPU_ARCHS="75"
      • Titan V, Tesla V100: -DGPU_ARCHS="70"
      • Multiple SMs: -DGPU_ARCHS="80 75"
    • TRT_PLATFORM_ID: Bare-metal build (unlike containerized cross-compilation) on non Linux/x86 platforms must explicitly specify the target platform. Currently supported options: x86_64 (default), aarch64

References

TensorRT Resources

Known Issues

Issues
  • How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    All right, so, I have a PyTorch detector SSD with MobileNet. Since I failed to convert model with NMS in it (to be more precise, I converted it, but TRT engine is built in a wrong way with that .onnx file), I decided to leave NMS part to TRT.

    In general, there are several ways to add NMS in TRT:

    1. Use graphsurgeon with TensorFlow model and add NMS as graphsurgeon.create_plugin_node
    2. Use CPP code for plugin (https://github.com/NVIDIA/TensorRT/tree/master/plugin/batchedNMSPlugin)
    3. Use DeepStream that has NMS plugin

    But, I have a PyTorch model that I converted to onnx and then to TRT without any CPP code (Python only). My question is very simple: how can I combine my current pipeline with the CPP plugin for NMS?

    Component: Plugins good-reference triaged 
    opened by ivanpanshin 83
  • [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2
    &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d ~/data
    

    ubuntu16.04 TensorRT 6.x (build source from git branch release/6.0) following tutorial converts matterport maskrcnn model successfully to uff, inference got this result.

    Component: UFF unsupported-op Samples 
    opened by jinfagang 52
  • tensort7 load onnx resize ops error

    tensort7 load onnx resize ops error

    Description

    when i load onnx model, fpn F.interpolate ops error


    While parsing node number 209 [Resize]: ERROR: builtin_op_importers.cpp:2412 In function importResize: [8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"


    this error in onnx-tensorrt

    Environment

    TensorRT Version: 7.0 GPU Type: 1060 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.5.32 Operating System + Version: win10

    OS: Windows Component: ONNX Framework: PyTorch good-reference Release: 7.x triaged 
    opened by syshensyshen 51
  • how to create an engine serve for multiple source inputs?

    how to create an engine serve for multiple source inputs?

    How can i create 1 engine (ex: 1 tensorrt detector engine) that can serve for 6 or 10 camera to detect object by using threading without get confused output between these source?

    question ask-the-experts triaged 
    opened by HoangTienDuc 45
  • How to add NMS with Tensorflow Model (that was converted to ONNX)

    How to add NMS with Tensorflow Model (that was converted to ONNX)

    I have taken an ssdlite mobile net v2 model from the tensorflow model zoo

    steps :

    1. generated the onnx model using the tf2onnx lib python -m tf2onnx.convert --graphdef mv2/ssdlite_mobilenet_v2_coco_2018_05_09/frozen_inference_graph.pb --output MODEL_frozen.onnx \ --fold_const --opset 11 \ --inputs image_tensor:0 \ --outputs num_detections:0,detection_boxes:0,detection_scores:0,detection_classes:0

    2. add the nms layers in the onnx model based on refferences from this issue

    import onnx_graphsurgeon as gs
    import onnx
    import numpy as np
    
    input_model_path = "MODEL_frozen.onnx"
    output_model_path = "model_gs.onnx"
    
    @gs.Graph.register()
    def trt_batched_nms(self, boxes_input, scores_input, nms_output,
                        share_location, num_classes):
    
        boxes_input.outputs.clear()
        scores_input.outputs.clear()
        nms_output.inputs.clear()
    
        attrs = {
            "shareLocation": share_location,
            "numClasses": num_classes,
            "backgroundLabelId": 0,
            "topK": 116740,
            "keepTopK": 100,
            "scoreThreshold": 0.3,
            "iouThreshold": 0.6,
            "isNormalized": True,
            "clipBoxes": True
        }
        return self.layer(op="BatchedNMS_TRT", attrs=attrs,
                          inputs=[boxes_input, scores_input],
                          outputs=[nms_output])
    
    
    graph = gs.import_onnx(onnx.load(input_model_path))
    graph.inputs[0].shape=[1,300,300,3]
    print(graph.inputs[0].shape)
    
    for inp in graph.inputs:
        inp.dtype = np.int
    
    input = graph.inputs[0]
    
    tmap = graph.tensors()
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    
    
    # Remove unused nodes, and topologically sort the graph.
    # graph.cleanup()
    # graph.toposort()
    # graph.fold_constants().cleanup()
    
    # Export the ONNX graph from graphsurgeon
    onnx.checker.check_model(gs.export_onnx(graph))
    onnx.save_model(gs.export_onnx(graph), output_model_path)
    
    print("Saving the ONNX model to {}".format(output_model_path))
    
    

    I am not able to figure it out in the onnx graph which nodes i should repalce in place of "Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0" and other

    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
          
    tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    

    MODEL_frozen.onnx.zip

    I have also attach the onnx file. Any sugeestions how to find it ?

    Component: ONNX Topic: ONNX Plugin triaged 
    opened by letdivedeep 43
  • BERT fp16 accuracy problem

    BERT fp16 accuracy problem

    Description

    When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?

    Environment

    TensorRT Version: 7.2.1 NVIDIA GPU: T4 NVIDIA Driver Version: 440.59 CUDA Version: 10.2 CUDNN Version: 8.0.4 Operating System: centos7 Python Version (if applicable): 3.6 Tensorflow Version (if applicable): 1.15.4 PyTorch Version (if applicable): Baremetal or Container (if so, version):

    Steps To Reproduce

    Proceed as follows: 1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine 2、when trt building, set these parameters: with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) ... 3、at the same time, I also tried to set the accuracy on these layers(such as: LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on):
    network.get_layer(i).precision = trt.DataType.FLOAT BUT no effect

    I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure

    Precision: FP16 Release: 7.x triaged 
    opened by chenzhanyiczy 42
  • Onnx Dynamic input to TensorRT

    Onnx Dynamic input to TensorRT

    [TensorRT] INTERNAL ERROR: Assertion failed: aMatrix.second == bMatrix.first ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp:35 Aborting... [TensorRT] ERROR: ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp (35) - Assertion Error in assertDimsOkayForMatrixMultiplyLayer: 0 (aMatrix.second == bMatrix.first)

    Topic: Dynamic Shape triaged Runtime: Error 
    opened by BarryKCL 41
  • ONNX networks can't use INT8 calibration and batching

    ONNX networks can't use INT8 calibration and batching

    Description

    This is due to mutually incompatible changes in the TRT7 release:

    https://docs.nvidia.com/deeplearning/sdk/tensorrt-release-notes/tensorrt-7.html

    ONNX parser with dynamic shapes support The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

    versus

    Known Issues The INT8 calibration does not work with dynamic shapes. To workaround this issue, ensure there are two passes in the code: Using a fixed shape input to build the engine in the first pass, allows TensorRT to generate the calibration cache.

    This means the ONNX network must be exported at a fixed batch size in order to get INT8 calibration working, but now it's no longer possible to specify the batch size. I also verified that manually fixing up the inputs with setDimensions(...-1...) does not work, you will hit an assertion mg.nodes[mg.regionIndices[outputRegion]].size ==mg.nodes[mg.regionIndices[inputRegion]].size while building.

    One would think there might be sort of a workaround by exporting two different networks, one with a fixed batch size and a second one with a dynamic_axis, and then using the calibration from one for the other. ~~However, even here there are severe pitfalls: a calibration cache that is generated for, say, batch_size=1 won't necessarily work for larger batch sizes, presumably because they will generate a different convolution strategy that causes different accuracy issues.~~ Edit: This might've been another issue.

    Lastly, the calibrator itself appears to be using implicit batch sizes, and breaks on batch size > 1 as follows:

    TRT: Starting Calibration with batch size 16. Calibrated 16 images. TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: C:\source\builder\cudnnCalibrator.cpp (707) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\builder\cudnnCalibrator.cpp (703) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\rtSafe\cuda\caskConvolutionRunner.cpp (233) - Cuda Error in nvinfer1::rt::task::CaskConvolutionRunner::allocateContextResources: 700 (an illegal memory access was encountered) TRT: FAILED_EXECUTION: Unknown exception TRT: Calibrated batch 0 in 2.62865 seconds. Cuda failure: 700

    with batch_size == 1, it's also hitting assertions:

    TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: Assertion failed: d.nbDims >= 1 C:\source\rtSafe\safeHelpers.cpp:419 Aborting...

    The combination of all these failures means that you can't really use ONNX networks in INT8 mode, at least the "Using a fixed shape input to build the engine in the first pass" recommendation hits all kinds of internal assertions as you can see above.

    Environment

    TensorRT Version: 7.0.0.11 GPU Type: RTX 2080 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.0.5 Operating System + Version: Windows 10 Python Version (if applicable): 3.6 TensorFlow Version (if applicable): PyTorch Version (if applicable): 1.3 stable Baremetal or Container (if container which image + tag): bare

    Relevant Files

    Steps To Reproduce

    bug Component: ONNX Precision: INT8 Release: 7.x 
    opened by gcp 38
  • trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    Hi, I have used samleUffMaskrcnn for my own dataset, it worked, but the results are different. The result of trt samleUffMaskrcnn depends much on anchor scales and anchor ratios, i set the same params in both test codes. The keras one performs better, as it can show more object(instances), but some object in trt maskrcnn can't be detected, especially Slender object, like a pole. Thanks for help

    opened by hwh-hit 37
  • [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    while I try to run the maskrcnn demo following this page

    Ubuntu 16.04.6 CUDA 10.1.168 tensorrt 5.1.5.0 uff 0.6.3

    Traceback (most recent call last):
      File "mrcnn_to_trt_single.py", line 165, in <module>
        main()
      File "mrcnn_to_trt_single.py", line 123, in main
        text=True, list_nodes=list_nodes)
      File "mrcnn_to_trt_single.py", line 158, in convert_model
        debug_mode = False
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 233, in from_tensorflow_frozen_model
        return from_tensorflow(graphdef, output_nodes, preprocessor, **kwargs)
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 108, in from_tensorflow
        pre.preprocess(dynamic_graph)
      File "./config.py", line 123, in preprocess
        connect(dynamic_graph, timedistributed_connect_pairs)
      File "./config.py", line 113, in connect
        if node_a_name not in dynamic_graph.node_map[node_b_name].input:
    KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1'
    
    Component: UFF Samples Release: 6.x Framework: TensorFlow Conversion: UFF good-reference 
    opened by seanyuner 36
  • (Upsample) How can I use onnx parser with opset 11 ?

    (Upsample) How can I use onnx parser with opset 11 ?

    Description

    onnx-parser is basically built with ir_version 3, opset 7 (https://github.com/onnx/onnx-tensorrt/blob/master/onnx_trt_backend.cpp)

    Is there any way to use onnx parser with opset 11 support ?

    I mean, parser works only with opset7 version. parser works well if I use ir4_opset7 version onnx model, but doesn't work if I use ir4_opset11 version onnx model.

    It also cannot parse opset 8 and 9.

    My onnx models are made by pytorch 1.4.0a.

    Can I rebuild the parser by changing only the BACKEND_OPSET constant inside onnx_trt_backend.cpp?

    Environment

    TensorRT Version: 7.0.0 GPU Type: T4 Nvidia Driver Version: 440.33.01 CUDA Version: 10.2.89 CUDNN Version: 7.6.5 Operating System + Version: Ubuntu18.04 Python Version (if applicable): 3.6.9 TensorFlow Version (if applicable): 1.4.0 PyTorch Version (if applicable): 1.4.0a

    API: Python Framework: PyTorch Conversion: torch.onnx Release: 7.x TODO 
    opened by dhkim0225 34
  • Question: trtexec hardware during compilation vs production?

    Question: trtexec hardware during compilation vs production?

    I am using tensorRT together with Triton. Inorder to avoid a long runing start up time when we deploy to edge nodes I would like to do the tensorRT conversion before we deploy(Triton have support for doing it from different model formats) using the cli: trtexec. Do I need to do the trtexet step on exactly the same hardware (GPU, CPU)?

    Thank you!

    triaged 
    opened by NikeNano 1
  • Question about trtexec log

    Question about trtexec log

    I use trtexec to convert my onnx model to int8 engine. But I find some log like 'Dynamic range would not be set for tensor (Unnamed Layer* 36) [Constant]_output for nbSpatialDims = 2. TensorRT does not support Int8 precision for tensors of index type or with dims < (nbSpatialDims + 2), expect fall back to non-int8 implementation for any layer consuming or producing given tensor if the layer assumes nbSpatialDims = 2.' I want to known whether it will affect convolutional layers of the model running with int8 kernels(Although a similar problem like the log above does not occur in convolutional layers). or there are any doc for introducing the trtexec log?

    triaged 
    opened by pycoco 1
  • mAP drops a lot when Infer a INT8 quantized ONNX model.

    mAP drops a lot when Infer a INT8 quantized ONNX model.

    Description

    Hi, I have a quantized Yolov5s ONNX model; When I use ONNX runtime to infer this model, I got the mAP of 36.8; But when I use C++ TRT backend, enable with INT8 inference, the mAP drops 10.9, I'm not sure what the problem is, could you please give some advice and check the model (attachment) ? Thanks!

    Environment

    TensorRT Version: 8.4.1.5 NVIDIA GPU: Tesla P40
    NVIDIA Driver Version: 510.47.03 CUDA Version: 11.2 CUDNN Version: 8.1.1 Operating System: Linux Python Version (if applicable): 3.7 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

    Relevant Files

    model link : https://bj.bcebos.com/v1/paddle-slim-models/act/yolov5s_quant.onnx file : yolov5s_quant.onnx.zip

    Steps To Reproduce

    csrcs/fastdeploy/backends/tensorrt/trt_backend.cc(91)::CheckDynamicShapeConfig The loaded model's input tensor:x2paddle_images has shape [1, 3, 640, 640]. [08/10/2022-09:56:39] [I] [TRT] [MemUsageChange] Init CUDA: CPU +196, GPU +0, now: CPU 239, GPU 1274 (MiB) [08/10/2022-09:56:40] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +6, GPU +2, now: CPU 262, GPU 1276 (MiB) [08/10/2022-09:56:40] [W] [TRT] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [INFO] csrcs/fastdeploy/backends/tensorrt/trt_backend.cc(430)::CreateTrtEngine Start to building TensorRT Engine... [08/10/2022-09:56:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +184, GPU +76, now: CPU 506, GPU 1352 (MiB) [08/10/2022-09:56:55] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +128, GPU +80, now: CPU 634, GPU 1432 (MiB) [08/10/2022-09:56:55] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.1.1 [08/10/2022-09:56:55] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [08/10/2022-09:58:20] [I] [TRT] Detected 1 inputs and 4 output network tensors. [08/10/2022-09:58:20] [I] [TRT] Total Host Persistent Memory: 145120 [08/10/2022-09:58:20] [I] [TRT] Total Device Persistent Memory: 2082816 [08/10/2022-09:58:20] [I] [TRT] Total Scratch Memory: 0 [08/10/2022-09:58:20] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 34 MiB, GPU 213 MiB [08/10/2022-09:58:20] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 87.029ms to assign 11 blocks to 181 nodes requiring 24294912 bytes. [08/10/2022-09:58:20] [I] [TRT] Total Activation Memory: 24294912 [08/10/2022-09:58:20] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +7, GPU +9, now: CPU 7, GPU 9 (MiB) [08/10/2022-09:58:20] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1. [08/10/2022-09:58:21] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1. [08/10/2022-09:58:21] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 616, GPU 1414 (MiB) [08/10/2022-09:58:21] [I] [TRT] Loaded engine size: 8 MiB [08/10/2022-09:58:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +9, now: CPU 0, GPU 9 (MiB) [INFO] csrcs/fastdeploy/backends/tensorrt/trt_backend.cc(496)::CreateTrtEngine TensorRT Engine is built succussfully. [08/10/2022-09:58:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +25, now: CPU 0, GPU 34 (MiB) loading annotations into memory... Done (t=0.62s) creating index... index created! 2022-08-10 09:58:21 [INFO] Starting to read file list from dataset... 2022-08-10 09:58:22 [INFO] ...

    Then it is the log to record the mAP. mAP is very low compared to the result from ONNX backend.

    Topic: QAT triaged 
    opened by yunyaoXYY 1
  • after tensorrt init finished, my own stream invalid

    after tensorrt init finished, my own stream invalid

    I have a TensorRT ver 8.2.3.0 and a nvJPEG ver 11.6.2.8 Tesla 4 GPU with driver ver 510.47.03 ubuntu & x86_64 here's my situation: I made a dynamic library which encapsulate nvJPEG decoder operations, contain mainly 2 functions, one is init() which create stream and buffers and so on, the other is decode() uses handler inited by init() and doing decode work I also have a main function which is python use TensorRT to do predict work, and dlopen previous dynamic library with DEEPBIND option

    here is the problem: if I call dynamic library function init() first which contain cudaStreamCreateWithFlags and then init tensorRT, then cudaEventRecord returns 400 "invalid resource handle" which means invalid stream

    then I put function init() after TensorRT init, every thing goes well

    so what is the real problem, Does TensorRT invalid all process stream?

    triaged 
    opened by neiblegy 1
  • How should I speed up T5 original exported saved_model by using TRT?

    How should I speed up T5 original exported saved_model by using TRT?

    THE ISSUES SECTION IS ONLY FOR FILING BUGS. PLEASE ASK YOUR QUESTION ON THE DISCUSSION TAB. My env:

    Docker image: nvcr.io/nvidia/tensorflow:22.05-tf2-py3, TRT: 8.2.5.1, CUDA: 11.7 tf 2.8 GPU: Tesla V100

    The original saved_model tooks 300ms when batch_size=32 and sen_length=128, it's too long for deploy. So I wanted to speed up t5 by using tf-trt. But when I convert saved_model using below code, tf-trt doesn't work:

    from tensorflow.python.compiler.tensorrt import trt_convert as trt
    import numpy as np
    import tensorflow_text
    import tensorflow as tf
    
    tf.compat.v1.disable_v2_behavior()
    
    input_saved_model_dir = 'exported_model/batch32_length128_0810/1660123651'
    output_saved_model_dir = 'trt_saved_model/batch32_length128_0810/1/'
    converter = trt.TrtGraphConverter(
        input_saved_model_dir=input_saved_model_dir,
        max_workspace_size_bytes=(11<32),
        max_batch_size=32,
        minimum_segment_size=50,
        precision_mode='FP32',
        is_dynamic_op=True,
        maximum_cached_engines=1)
    
    converter.convert()
    converter.save(output_saved_model_dir)
    
    

    Before using the code, you should add some code in tensorflow/python/compiler/tensorrt/trt_convert.py. The reference is here Could some body help me about this?

    triaged 
    opened by chenl-keep 3
  • How to build arm Architecture TensorRT OSS v6.0.1

    How to build arm Architecture TensorRT OSS v6.0.1

    Description

    I want to build arm Architecture TensorRT OSS v6.0.1

    I used the minimum version of cuda and cudnn that supports arm

    cuda_11.0.2_450.51.05_linux_sbsa.run
    cudnn-11.2-linux-aarch64sbsa-v8.1.0.77.tgz
    

    but according to the README file ,i need to download TensorRT-GA v6.0.1 which does not support arm Architecture

    Environment

    TensorRT Version: v6.0.1 NVIDIA GPU: NVIDIA A100 NVIDIA Driver Version: 470.63.01 CUDA Version: 11.0 CUDNN Version: 8.1.0

    Can I install a higher version of TensorRT-GA to meet the arm architecture? How to handle this situation?

    triaged 
    opened by jamdodot 3
Releases(22.07)
  • 22.07(Jul 22, 2022)

    Commit used by the 22.07 TensorRT NGC container.

    Changelog

    Added

    • polygraphy-trtexec-plugin tool for Polygraphy
    • Multi-profile support for demoBERT
    • KV cache support for HF BART demo

    Changed

    • Updated ONNX-GS to v0.3.20

    Removed

    • None
    Source code(tar.gz)
    Source code(zip)
  • 8.4.1(Jun 14, 2022)

    TensorRT OSS release corresponding to TensorRT 8.4.1.5 GA release.

    Key Features and Updates:

    • Samples enhancements

    • EfficientDet sample

      • Added support for EfficientDet Lite and AdvProp models.
      • Added dynamic batch support.
      • Added mixed precision engine builder.
    • HuggingFace transformer demo

      • Added BART model.
      • Performance speedup of GPT-2 greedy search using GPU implementation.
      • Fixed GPT2 onnx export failure due to 2G file size limitation.
      • Extended Megatron LayerNorm plugins to support larger hidden sizes.
      • Added performance benchmarking mode.
      • Enable tf32 format by default.
    • demoBERT enhancements

      • Add --duration flag to perf benchmarking script.
      • Fixed import of nvinfer_plugins library in demoBERT on Windows.
    • Torch-QAT toolkit

      • quant_bert.py module removed. It is now upstreamed to HuggingFace QDQBERT.
      • Use axis0 as default for deconv.
      • #1939 - Fixed path in classification_flow example.
    • Plugin enhancements

    • Build containers

      • Updated default cuda versions to 11.6.2.
      • CentOS Linux 8 has reached End-of-Life on Dec 31, 2021. The corresponding container has been removed from TensorRT-OSS.
      • Install devtoolset-8 for updated g++ versions in CentOS7 container.
    • Tooling enhancements

    • trtexec enhancements

      • Added --layerPrecisions and --layerOutputTypes flags for specifying layer-wise precision and output type constraints.
      • Added --memPoolSize flag to specify the size of workspace as well as the DLA memory pools via a unified interface. Correspondingly the --workspace flag has been deprecated.
      • "End-To-End Host Latency" metric has been removed. Use the “Host Latency” metric instead. For more information, refer to Benchmarking Network section in the TensorRT Developer Guide.
      • Use enqueueV2() instead of enqueue() when engine has explicit batch dimensions.
    Source code(tar.gz)
    Source code(zip)
  • 22.06(Jun 9, 2022)

    Commit used by the 22.06 TensorRT NGC container.

    Changelog

    Added

    • None

    Changed

    • Disentangled attention (DMHA) plugin refactored
    • ONNX parser updated to 8.2GA

    Removed

    • None
    Source code(tar.gz)
    Source code(zip)
  • 22.05(May 13, 2022)

    Commit used by the 22.05 TensorRT NGC container.

    Changelog

    Added

    • Disentangled attention plugin for DeBERTa
    • DMHA (multiscaleDeformableAttnPlugin) plugin for DDETR
    • Performance benchmarking mode to HuggingFace demo

    Changed

    • Updated base TensorRT version to 8.2.5.1
    • Updated onnx-graphsurgeon v0.3.19 CHANGELOG
    • fp16 support for pillarScatterPlugin
    • #1939 - Fixed path in quantization classification_flow
    • Fixed GPT2 onnx export failure due to 2G limitation
    • Use axis0 as default for deconv in pytorch-quantization toolkit
    • Updated onnx export script for CoordConvAC sample
    • Install devtoolset-8 for updated g++ version in CentOS7 container

    Removed

    • Usage of deprecated TensorRT APIs in samples removed
    • quant_bert.py module removed from pytorch-quantization
    Source code(tar.gz)
    Source code(zip)
  • 22.04(Apr 14, 2022)

    Commit used by the 22.04 TensorRT NGC container.

    Changelog

    Added

    • TensorRT Engine Explorer v0.1.0 README
    • Detectron 2 Mask R-CNN R50-FPN python sample
    • Model export script for sampleOnnxMnistCoordConvAC

    Changed

    • Updated base TensorRT version to 8.2.4.2
    • Updated copyright headers with SPDX identifiers
    • Updated onnx-graphsurgeon v0.3.17 CHANGELOG
    • PyramidROIAlign plugin refactor and bug fixes
    • Fixed MultilevelCropAndResize crashes on Windows
    • #1583 - sublicense ieee/half.h under Apache2
    • Updated demo/BERT performance tables for rel-8.2
    • #1774 Fix python hangs at IndexErrors when TF is imported after TensorRT
    • Various bugfixes in demos - BERT, Tacotron2 and HuggingFace GPT/T5 notebooks
    • Cleaned up sample READMEs

    Removed

    • sampleNMT removed from samples
    Source code(tar.gz)
    Source code(zip)
  • 22.03(Mar 24, 2022)

    Commit used by the 22.03 TensorRT NGC container.

    Changelog

    Added

    • EfficientDet sample enhancements
      • Added support for EfficientDet Lite and AdvProp models.
      • Added dynamic batch support.
      • Added mixed precision engine builder.

    Changed

    • Better decoupling of HuggingFace demo tests
    Source code(tar.gz)
    Source code(zip)
  • 22.02(Feb 4, 2022)

    Commit used by the 22.02 TensorRT NGC container.

    Changelog

    Added

    Changed

    • Extend Megatron LayerNorm plugins to support larger hidden sizes
    • Refactored EfficientNMS plugin for TFTRT and added implicit batch mode support
    • Update base TensorRT version to 8.2.3.0
    • GPT-2 greedy search speedup - now runs on GPU
    • Updates to TensorRT developer tools
    • Updated ONNX parser to v8.2.3.0
    • Minor updates and bugfixes
      • Samples: TFOD, GPT-2, demo/BERT
      • Plugins: proposalPlugin, geluPlugin, bertQKVToContextPlugin, batchedNMS

    Removed

    • Unused source file(s) in demo/BERT
    Source code(tar.gz)
    Source code(zip)
  • 22.01(Jan 24, 2022)

  • 8.2.1(Nov 24, 2021)

    TensorRT OSS release corresponding to TensorRT 8.2.1.8 GA release.

    • Updates since TensorRT 8.2.0 EA release.

    • Please refer to the TensorRT 8.2.1 GA release notes for more information.

    • ONNX parser v8.2.1

      • Removed duplicate constant layer checks that caused some performance regressions
      • Fixed expand dynamic shape calculations
      • Added parser-side checks for Scatter layer support
    • Sample updates

      • Added Tensorflow Object Detection API converter samples, including Single Shot Detector, Faster R-CNN and Mask R-CNN models
      • Multiple enhancements in HuggingFace transformer demos
        • Added multi-batch support
        • Fixed resultant performance regression in batchsize=1
        • Fixed T5 large/T5-3B accuracy issues
        • Added notebooks for T5 and GPT-2
        • Added CPU benchmarking option
      • Deprecated kSTRICT_TYPES (strict type constraints). Equivalent behaviour now achieved by setting PREFER_PRECISION_CONSTRAINTS, DIRECT_IO, and REJECT_EMPTY_ALGORITHMS
      • Removed sampleMovieLens
      • Renamed sampleReformatFreeIO to sampleIOFormats
      • Add idleTime option for samples to control qps
      • Specify default value for precisionConstraints
      • Fixed reporting of TensorRT build version in trtexec
      • Fixed combineDescriptions typo in trtexec/tracer.py
      • Fixed usages of kDIRECT_IO
    • Plugin updates

      • EfficientNMS plugin support extended to TF-TRT, and for clang builds.
      • Sanitize header definitions for BERT fused MHA plugin
      • Separate C++ and cu files in splitPlugin to avoid PTX generation (required for CUDA enhanced compatibility support)
      • Enable C++14 build for plugins
    • ONNX tooling updates

    • Build and container fixes

      • Add SM86 target to default GPU_ARCHS for platforms with cuda-11.1+
      • Remove deprecated SM_35 and add SM_60 to default GPU_ARCHS
      • Skip CUB builds for cuda 11.0+ #1455
      • Fixed cuda-10.2 container build failures in Ubuntu 20.04
      • Add native ARM server build container
      • Install devtoolset-8 for updated g++ version in CentOS7
      • Added a note on supporting c++14 builds for CentOS7
      • Fixed docker build for large UIDs #1373
      • Updated README instructions for Jetpack builds
    • demo enhancements

      • Updated Tacotron2 instructions and add CPU benchmarking
      • Fixed issues in demoBERT python notebook
    • Documentation updates

      • Updated Python documentation for add_reduce, add_top_k, and ISoftMaxLayer
      • Renamed default GitHub branch to main and updated hyperlinks
    Source code(tar.gz)
    Source code(zip)
  • 21.10(Oct 5, 2021)

    Commit used by the 21.10 TensorRT NGC container.

    Changelog

    Added

    • Benchmark script for demoBERT-Megatron
    • Dynamic Input Shape support for EfficientNMS plugin
    • Support empty dimensions in ONNX
    • INT32 and dynamic clips through elementwise in ONNX parser

    Changed

    • Bump TensorRT version to 8.0.3.4
    • Use static shape for only single batch single sequence input in demo/BERT
    • Revert to using native FC layer in demo/BERT and FCPlugin only on older GPUs.
    • Update demo/Tacotron2 for TensorRT 8.0
    • Updates to TensorRT developer tools
      • Polygraphy v0.33.0
        • Added various examples, a CLI User Guide and how-to guides.
        • Added experimental support for DLA.
        • Added a data to-input tool that can combine inputs/outputs created by --save-inputs/--save-outputs.
        • Added a PluginRefRunner which provides CPU reference implementations for TensorRT plugins
        • Made several performance improvements in the Polygraphy CUDA wrapper.
        • Removed the to-json tool which was used to convert Pickled data generated by Polygraphy 0.26.1 and older to JSON.
      • Bugfixes and documentation updates in pytorch-quantization toolkit.
    • Bumped up package versions: tensorflow-gpu 2.5.1, pillow 8.3.2
    • ONNX parser enhancements and bugfixes
      • Update ONNX submodule to v1.8.0
      • Update convDeconvMultiInput function to properly handle deconvs
      • Update RNN documentation
      • Update QDQ axis assertion
      • Fix bidirectional activation alpha and beta values
      • Fix opset10 Resize
      • Fix shape tensor unsqueeze
      • Mark BOOL tiles as unsupported
      • Remove unnecessary shape tensor checks

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 8.2.0-EA(Oct 5, 2021)

    TensorRT OSS release corresponding to TensorRT 8.2.0.6 EA release.

    Added

    • Demo applications showcasing TensorRT inference of HuggingFace Transformers.
      • Support is currently extended to GPT-2 and T5 models.
    • Added support for the following ONNX operators:
      • Einsum
      • IsNan
      • GatherND
      • Scatter
      • ScatterElements
      • ScatterND
      • Sign
      • Round
    • Added support for building TensorRT Python API on Windows.

    Updated

    • Notable API updates in TensorRT 8.2.0.6 EA release. See TensorRT Developer Guide for details.
      • Added three new APIs, IExecutionContext: getEnqueueEmitsProfile(), setEnqueueEmitsProfile(), and reportToProfiler() which can be used to collect layer profiling info when the inference is launched as a CUDA graph.
      • Eliminated the global logger; each Runtime, Builder or Refitter now has its own logger.
      • Added new operators: IAssertionLayer, IConditionLayer, IEinsumLayer, IIfConditionalBoundaryLayer, IIfConditionalOutputLayer, IIfConditionalInputLayer, and IScatterLayer.
      • Added new IGatherLayer modes: kELEMENT and kND
      • Added new ISliceLayer modes: kFILL, kCLAMP, and kREFLECT
      • Added new IUnaryLayer operators: kSIGN and kROUND
      • Added new runtime class IEngineInspector that can be used to inspect the detailed information of an engine, including the layer parameters, the chosen tactics, the precision used, etc.
      • ProfilingVerbosity enums have been updated to show their functionality more explicitly.
    • Updated TensorRT OSS container defaults to cuda 11.4
    • CMake to target C++14 builds.
    • Updated following ONNX operators:
      • Gather and GatherElements implementations to natively support negative indices
      • Pad layer to support ND padding, along with edge and reflect padding mode support
      • If layer with general performance improvements.

    Removed

    • Removed sampleMLP.
    • Several flags of trtexec have been deprecated:
      • --explicitBatch flag has been deprecated and has no effect. When the input model is in UFF or in Caffe prototxt format, the implicit batch dimension mode is used automatically; when the input model is in ONNX format, the explicit batch mode is used automatically.
      • --explicitPrecision flag has been deprecated and has no effect. When the input ONNX model contains Quantization/Dequantization nodes, TensorRT automatically uses explicit precision mode.
      • --nvtxMode=[verbose|default|none] has been deprecated in favor of --profilingVerbosity=[detailed|layer_names_only|none] to show its functionality more explicitly.

    Signed-off-by: Rajeev Rao [email protected]

    Source code(tar.gz)
    Source code(zip)
  • 21.09(Sep 22, 2021)

    Commit used by the 21.09 TensorRT NGC container.

    Changelog

    Added

    • Add ONNX2TRT_VERSION overwrite in CMake.

    Changed

    • Updates to TensorRT developer tools
    • Fix assertion in EfficientNMSPlugin

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 21.08(Aug 5, 2021)

    Commit used by the 21.08 TensorRT NGC container.

    Changelog

    Added

    Changed

    • Updated samples and plugins directory structure
    • Updates to TensorRT developer tools
    • README fix to update build command for native aarch64 builds.

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 21.07(Jul 21, 2021)

  • 8.0.1(Jul 2, 2021)

    TensorRT OSS release corresponding to TensorRT 8.0.1.6 GA release.

    Added

    • Added support for the following ONNX operators: Celu, CumSum, EyeLike, GatherElements, GlobalLpPool, GreaterOrEqual, LessOrEqual, LpNormalization, LpPool, ReverseSequence, and SoftmaxCrossEntropyLoss.
    • Rehauled Resize ONNX operator, now fully supporting the following modes:
      • Coordinate Transformation modes: half_pixel, pytorch_half_pixel, tf_half_pixel_for_nn, asymmetric, and align_corners.
      • Modes: nearest, linear.
      • Nearest Modes: floor, ceil, round_prefer_floor, round_prefer_ceil.
    • Added support for multi-input ONNX ConvTranpose operator.
    • Added support for 3D spatial dimensions in ONNX InstanceNormalization.
    • Added support for generic 2D padding in ONNX.
    • ONNX QuantizeLinear and DequantizeLinear operators leverage IQuantizeLayer and IDequantizeLayer.
      • Added support for tensor scales.
      • Added support for per-axis quantization.
    • Added EfficientNMS_TRT, EfficientNMS_ONNX_TRT plugins and experimental support for ONNX NonMaxSuppression operator.
    • Added ScatterND plugin.
    • Added TensorRT QuickStart Guide.
    • Added new samples: engine_refit_onnx_bidaf builds an engine from ONNX BiDAF model and refits engine with new weights, efficientdet and efficientnet samples for demonstrating Object Detection using TensorRT.
    • Added support for Ubuntu20.04 and RedHat/CentOS 8.3.
    • Added Python 3.9 support.

    Changed

    • Update Polygraphy to v0.30.3.
    • Update ONNX-GraphSurgeon to v0.3.10.
    • Update Pytorch Quantization toolkit to v2.1.0.
    • Notable TensorRT API updates
      • TensorRT now declares API’s with the noexcept keyword. All TensorRT classes that an application inherits from (such as IPluginV2) must guarantee that methods called by TensorRT do not throw uncaught exceptions, or the behavior is undefined.
      • Destructors for classes with destroy() methods were previously protected. They are now public, enabling use of smart pointers for these classes. The destroy() methods are deprecated.
    • Moved RefitMap API from ONNX parser to core TensorRT.
    • Various bugfixes for plugins, samples and ONNX parser.
    • Port demoBERT to tensorflow2 and update UFF samples to leverage nvidia-tensorflow1 container.

    Removed

    • IPlugin and IPluginFactory interfaces were deprecated in TensorRT 6.0 and have been removed in TensorRT 8.0. We recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt and IPluginV2IOExt interfaces. For more information, refer to Migrating Plugins From TensorRT 6.x Or 7.x To TensorRT 8.x.x.
      • For plugins based on IPluginV2DynamicExt and IPluginV2IOExt, certain methods with legacy function signatures (derived from IPluginV2 and IPluginV2Ext base classes) which were deprecated and marked for removal in TensorRT 8.0 will no longer be available.
    • Removed samplePlugin since it showcased IPluginExt interface, which is no longer supported in TensorRT 8.0.
    • Removed sampleMovieLens and sampleMovieLensMPS.
    • Removed Dockerfile for Ubuntu 16.04. TensorRT 8.0 debians for Ubuntu 16.04 require python 3.5 while minimum required python version for TensorRT OSS is 3.6.
    • Removed support for PowerPC builds, consistent with TensorRT GA releases.

    Notes

    • We had deprecated the Caffe Parser and UFF Parser in TensorRT 7.0. They are still tested and functional in TensorRT 8.0, however, we plan to remove the support in a future release. Ensure you migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT).

    Signed-off-by: Rajeev Rao [email protected]

    Source code(tar.gz)
    Source code(zip)
  • 21.06(Jun 23, 2021)

    Commit used by the 21.06 TensorRT NGC container

    Changelog

    Added

    • Add switch for batch-agnostic mode in NMS plugin
    • Add missing model.py in uff_custom_plugin sample

    Changed

    • Update to Polygraphy v0.29.2
    • Update to ONNX-GraphSurgeon v0.3.9
    • Fix numerical errors for float type in NMS/batchedNMS plugins
    • Update demoBERT input dimensions to match Triton requirement #1051
    • Optimize TLT MaskRCNN plugins:
      • enable fp16 precision in multilevelCropAndResizePlugin and multilevelProposeROIPlugin
      • Algorithms optimization for NMS kernels and ROIAlign kernel
      • Fix invalid cuda config issue when bs is larger than 32
      • Fix issues found on Jetson NANO

    Removed

    • Removed fcplugin from demoBERT to improve inference latency on GA100/Turing
    Source code(tar.gz)
    Source code(zip)
  • 21.05(May 19, 2021)

    Commit used by the 21.05 TensorRT NGC container

    Changelog

    Added

    • Extended support for ONNX operator InstanceNormalization to 5D tensors
    • Support negative indices in ONNX Gather operator
    • Add support for importing ONNX double-typed weights as float
    • ONNX-GraphSurgeon (v0.3.7) support for models with externally stored weights

    Changed

    • Update ONNX-TensorRT to 21.05
    • Relicense ONNX-TensorRT under Apache2
    • demoBERT builder fixes for multi-batch
    • Speedup demoBERT build using global timing cache and disable cuDNN tactics
    • Standardize python package versions across OSS samples
    • Bugfixes in multilevelProposeROI and bertQKV plugin
    • Fix memleaks in samples logger
    Source code(tar.gz)
    Source code(zip)
  • 21.04(Apr 12, 2021)

    Commit used by the 21.04 TensorRT NGC container

    Changelog

    Added

    • SM86 kernels for BERT MHA plugin
    • Added opset13 support for SoftMax, LogSoftmax, Squeeze, and Unsqueeze.
    • Added support for the EyeLike and GatherElements operators.

    Changed

    • Updated TensorRT version to v7.2.3.4.
    • Update to ONNX-TensorRT 21.03
    • ONNX-GraphSurgeon (v0.3.4) - updates fold_constants to correctly exit early.
    • Set default CUDA_INSTALL_DIR #798
    • Plugin bugfixes, qkv kernels for sm86
    • Fixed GroupNorm CMakeFile for cu sources #1083
    • Permit groupadd with non-unique GID in build containers #1091
    • Avoid reinterpret_cast #146
    • Clang-format plugins and samples
    • Avoid arithmetic on void pointer in multilevelProposeROIPlugin.cpp #1028
    • Update BERT plugin documentation.

    Removed

    • Removes extra terminate call in InstanceNorm
    Source code(tar.gz)
    Source code(zip)
  • 21.03(Mar 10, 2021)

  • 21.02(Feb 5, 2021)

    Commit used by the 21.02 TensorRT NGC container

    Changelog

    Added

    Changed

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 20.12(Dec 19, 2020)

    Commit used by the 20.12 TensorRT NGC container

    Changelog

    Added

    • Add configurable input size for TLT MaskRCNN Plugin

    Changed

    • Update symbol export map for plugins
    • Correctly use channel dimension when creating Prelu node
    • Fix Jetson cross compilation CMakefile

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 20.11(Nov 20, 2020)

  • 20.10(Oct 22, 2020)

    Commit used by the 20.10 TensorRT NGC container

    Changelog

    Added

    • Polygraphy v0.20.13 - Deep Learning Inference Prototyping and Debugging Toolkit
    • PyTorch-Quantization Toolkit v2.0.0
    • Updated BERT plugins for variable sequence length inputs
      • Optimized kernels for sequence lengths of 64 and 96 added
    • Added Tacotron2 + Waveglow TTS demo #677
    • Re-enable GridAnchorRect_TRT plugin with rectangular feature maps #679
    • Update batchedNMS plugin to IPluginV2DynamicExt interface #738
    • Support 3D inputs in InstanceNormalization plugin #745
    • Added this CHANGELOG.md

    Changed

    • ONNX GraphSurgeon - v0.2.7 with bugfixes, new examples.
    • demo/BERT bugfixes for Jetson Xavier
    • Updated build Dockerfile to cuda-11.1
    • Updated ClangFormat style specification according to TensorRT coding guidelines

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 7.2.1(Oct 20, 2020)

    TensorRT OSS release corresponding to TensorRT 7.2.1.6 GA build.

    Changelog

    Added

    • Polygraphy v0.20.13 - Deep Learning Inference Prototyping and Debugging Toolkit
    • PyTorch-Quantization Toolkit v2.0.0
    • Updated BERT plugins for variable sequence length inputs
      • Optimized kernels for sequence lengths of 64 and 96 added
    • Added Tacotron2 + Waveglow TTS demo #677
    • Re-enable GridAnchorRect_TRT plugin with rectangular feature maps #679
    • Update batchedNMS plugin to IPluginV2DynamicExt interface #738
    • Support 3D inputs in InstanceNormalization plugin #745
    • Added this CHANGELOG.md

    Changed

    • ONNX GraphSurgeon - v0.2.7 with bugfixes, new examples.
    • demo/BERT bugfixes for Jetson Xavier
    • Updated build Dockerfile to cuda-11.1
    • Updated ClangFormat style specification according to TensorRT coding guidelines

    Removed

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • 20.09(Sep 28, 2020)

  • 20.07(Jul 21, 2020)

  • 20.06(Jul 14, 2020)

  • 7.1.3(Jul 1, 2020)

  • 20.04(May 5, 2020)

  • 20.03(Mar 30, 2020)

A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Tencent 502 Jul 31, 2022
SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters

SHARK Communication Channels GitHub issues: Feature requests, bugs etc Nod.ai SHARK Discord server: Real time discussions with the nod.ai team and oth

nod.ai 37 Aug 3, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

NeuroSim 23 Aug 5, 2022
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

iwatake 154 Aug 1, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 53 Aug 9, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

null 847 Aug 6, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 42 Jul 18, 2022
An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

手写AI 1.1k Aug 8, 2022
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

NeuroSim 10 Dec 21, 2021
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Jul 30, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

NetEase Youdao 176 Jul 21, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

JoliBrain 2.4k Aug 14, 2022
Dataset Synthesizer - NVIDIA Deep learning Dataset Synthesizer (NDDS)

NVIDIA Deep learning Dataset Synthesizer (NDDS) Overview NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-qualit

NVIDIA Corporation 506 Jul 23, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Tencent 110 Jul 16, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

null 76 Aug 11, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

Xiaomi 4.7k Aug 6, 2022
NVIDIA GPUs htop like monitoring tool

NVTOP What is NVTOP? Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about

Maxime Schmitt 4.2k Aug 15, 2022
Inference framework for MoE layers based on TensorRT with Python binding

InfMoE Inference framework for MoE-based models, based on a TensorRT custom plugin named MoELayerPlugin (including Python binding) that can run infere

Shengqi Chen 31 Jul 19, 2022