Inference framework for MoE layers based on TensorRT with Python binding

Overview

InfMoE

Inference framework for MoE-based models, based on a TensorRT custom plugin named MoELayerPlugin (including Python binding) that can run inference of MoE layers with any sub-layer on NVIDIA GPUs with minimal memory consumption.

InfMoE is open-sourced under MIT License.

Installation

Dependencies:

  • CUDA (>=10.2)
  • cuDNN (>=8.0, corresponding to CUDA version)
  • TensorRT (>=7.0, corresponding to CUDA & cuDNN version)
  • zlib (to read npz files)
  • meson & ninja (building system)

Python (recommended)

To use TensorRT in Python, you need to first install:

Simply you could run python3 -m pip install -r requirements.txt.

Note: If you install nvidia-tensorrt from PyPI (but not from downloaded TensorRT package), you MUST ensure the version of TensorRT that MoELayerPlugin links to matches the version that pip package uses (see site-packages/tensorrt/). Otherwise the plugin will not work correctly.

Then build this plugin:

cd python

# if you have cuDNN & TensorRT installed in search path, or
python3 setup.py build_ext
# if you need to specify CUDA / cuDNN install location
# (CUDA can only be automatically searched by meson)
python3 setup.py build_ext --tensorrt-prefix=/path/to/tensorrt --cudnn-prefix=/path/to/cudnn

python3 setup.py install .

You can also use bdist_wheel or other commands provided by setuptools. You can pass --debug to build_ext to enable verbose logging & keep the symbols for debugging purpose.

C++ only (advanced)

cd plugin

# if you have cuDNN & TensorRT installed in search path
make builddir && make compile

# if you need to specify CUDA / cuDNN install location
# (CUDA can only be automatically searched by meson)
meson setup build -DWITH_TENSORRT=/path/to/tensorrt -DWITH_CUDNN=/path/to/cudnn
ninja -C builddir # or just run `make`

If everything goes well, you can find libtrtmoelayer.so in builddir. Similarly you can pass -DDEBUG=true to meson setup for debugging.

Plugin attributes

When initializing MoELayerPlugin in TensorRT (either C++ or Python), the following attributes must be specified:

  • expert_count: INT32, number of experts (sub-layers)
  • embedding_size: INT32, the input & output size of expert network
  • hidden_size: INT32, the intermediate size of feed forward network (might not be used by sub-layer)
  • max_concurrency: INT32, maximal concurrent experts in GPU memory (default to 2), setting it too large will lead to OOM
  • expert_centroids: FLOAT32 array, weight for dispatching tokens to experts, must be shape (d_model, expert_count) where d_model is the last dimension of input tensor (a.k.a. embedding size)
  • expert_weight_file: null-terminated CHAR array, path to expert weight file, to be read by implmentation of sub-layer
  • expert_sublayer_type: null-terminated CHAR array, type of sub-layer used, currently only T5_FF can be used
  • moe_variant: null-terminated CHAR array, variant type of MoE layer, used to decide different behaviours (can be cpm_2, base_layer or default)
  • layernorm_weight: FLOAT32 array, weight of layer norm layer applied to input before calculating expert affliation / score, must be provided when moe_variant is cpm_2

Usage

Currently InfMoE can only handle MoE layers with FP32 parameters, input & output. To run inference with a full network, you should slice it before and after any MoE layer:

  • For non-MoE layers, jsut save them as onnx / UFF format and use TensorRT to parse it into a network (Python / C++). Or you can use TensorRT API to construct the network manually (Python / C++).
  • For MoE layers, dump expert centroids and weights of each expert separately (in the format mentioned below), create a layer using MoELayerPlugin with Python or C++ (see examples).

Then you can concatenate MoE / non-MoE layers to obtain the full network (or replace any specific 'placeholder' layer with MoE layer), which can be later built into a TensorRT CUDA engine and used to run inference with / serialize & dump to file.

We provide several Python examples in python/examples showing how to do the aforementioned work. You can run them after installing this plugin. You are encouraged to read TensorRT documentation to understand its workflow prior to using this plugin.

Error handling

InfMoE requires that none of the following tensors contains NaN values:

  • layer input
  • expert centroids
  • weight of layer norm (if applicable)

It will also check the shape and data type of all parameters, input & output tensors. If any misconfiguration is found, it will print error message to stderr and abort the whole process.

Scheduling

See CPM-2 paper for scheduling details. To be ported to public source code soon.

Sub-layer

We have provided some sublayers in plugin/sublayers. To implement your own sub-layer, you need to:

  • Extend MoESubLayer class
  • Add your layer name and initialization code to MoELayerPlugin.h (in sublayer_type) and MoELayerPlugin.cc (in MoELayerPlugin::createSublayer())
  • Add your source file (.cpp only) to meson.build
  • Rebuild the plugin

T5FFLayer (T5_FF)

This project includes an sublayer implementation of feed-forward layer in T5 network. It is defined as:

hs := hs + dense_relu_dense(layer_norm(hs))
layer_norm(hs) := wl * hs / sqrt(mean(pow(hs, 2)) + eps)
dense_relu_dense(hs) := (gelu(hs @ wi_0^T) * (hs @ wi_1^T)) @ wo^T

where wi_0, wi_1 and wo are linear layers with no bias, first converting input tensor to 4 times large (in last dimension) then back.

The given export_weight_file must be a npz file containing the following variables (n varies from 0 to expert_count - 1): n/layer_norm_weight, n/wi_0_weight, n/wi_1_weight, n/wo_weight.

IdentityLayer (Identity)

This layer DOES NOTHING (thus use none of the provided plugin attributes), just copies the input directly to the output. It is intended for debugging purpose only.

You might also like...
A framework for generic hybrid two-party computation and private inference with neural networks

MOTION2NX -- A Framework for Generic Hybrid Two-Party Computation and Private Inference with Neural Networks This software is an extension of the MOTI

ncnn is a high-performance neural network inference framework optimized for the mobile platform
ncnn is a high-performance neural network inference framework optimized for the mobile platform

ncnn ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployme

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

C++ library based on tensorrt integration
C++ library based on tensorrt integration

3行代码实现YoloV5推理,TensorRT C++库 支持最新版tensorRT8.0,具有最新的解析器算子支持 支持静态显性batch size,和动态非显性batch size,这是官方所不支持的 支持自定义插件,简化插件的实现过程 支持fp32、fp16、int8的编译 优化代码结构,打印

A multi object tracking Library Based on tensorrt
A multi object tracking Library Based on tensorrt

YoloV5_JDE_TensorRT_for_Track Introduction A multi object detect and track Library Based on tensorrt 一个基于TensorRT的多目标检测和跟踪融合算法库,可以同时支持行人的多目标检测和跟踪,当然也可

A GKR-based zero-knowledge proof protocol for CNN model inference.

zkCNN Introduction This is the implementation of this paper, which is a GKR-based zero-knowledge proof for CNN reference, containing some common CNN m

TensorRT int8 量化部署 yolov5s 4.0 模型,实测3.3ms一帧!

tensorrt模型推理 git clone https://github.com/Wulingtian/yolov5_tensorrt_int8.git(求star) cd yolov5_tensorrt_int8 vim CMakeLists.txt 修改USER_DIR参数为自己的用户根目录

TensorRT implementation of RepVGG models from RepVGG: Making VGG-style ConvNets Great Again

RepVGG RepVGG models from "RepVGG: Making VGG-style ConvNets Great Again" https://arxiv.org/pdf/2101.03697.pdf For the Pytorch implementation, you can

Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

Comments
  • ETA for Official Release

    ETA for Official Release

    Thank you for your work. When can we expect an official release of TsinghuaAI/InfMoE? Further, do you plan to release example code that works out-of-the-box with CPM-2? Thanks!

    opened by lizy14 0
Owner
Shengqi Chen
Shengqi Chen
Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and deploy without Python.

Python Inference Script(PyIS) Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and d

Microsoft 13 Nov 4, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 62 Dec 14, 2022
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

iwatake 192 Dec 26, 2022
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

NVIDIA Corporation 6.4k Jan 4, 2023
Implement yolov5 with Tensorrt C++ api, and integrate batchedNMSPlugin. A Python wrapper is also provided.

yolov5 Original codes from tensorrtx. I modified the yololayer and integrated batchedNMSPlugin. A yolov5s.wts is provided for fast demo. How to genera

weiwei zhou 46 Dec 6, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

null 80 Dec 27, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

NeuroSim 32 Nov 24, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Tencent 113 Dec 23, 2022
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

NeuroSim 11 Dec 15, 2022
ffcnn is a cnn neural network inference framework, written in 600 lines C language.

+----------------------------+ ffcnn 卷积神经网络前向推理库 +----------------------------+ ffcnn 是一个 c 语言编写的卷积神经网络前向推理库 只用了 500 多行代码就实现了完整的 yolov3、yolo-fastes

ck 54 Dec 28, 2022