PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

Overview

PPLNN

Overview

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing. It can run various ONNX models and has better support for OpenMMLab.

alt arch

Documents

Contact Us

Contributions

This project uses Contributor Covenant as code of conduct. Any contributions would be highly appreciated.

Acknowledgements

License

This project is distributed under the Apache License, Version 2.0.

Issues
  • cuda convolution kernel input question.

    cuda convolution kernel input question.

    Hi,

    I see current implemented cuda conv kernel are either fp16 or int8. And those kernel's data layout is NHWC, as is requred by nvidia's tensor core. So like ./tools/pplnn.py, where it do the layout transpose? in the cpu side? As from nvprof result, I only see the conv kernel.

    If I want to do the transpose at the gpu side, how should I change the command? Or I need to add additional transpose node in the onnx file?

    opened by leiwen83 12
  • 【gemm_fp32_fma performance】常用shape下,gemm_fp32_fma和tensorflow1.15  eigen matmul性能几乎持平?

    【gemm_fp32_fma performance】常用shape下,gemm_fp32_fma和tensorflow1.15 eigen matmul性能几乎持平?

    问题:测试sgemm时,发现openppl的gemm_fp32_fma在一些常用shape下和tensorflow1.15 eigen matmul性能几乎持平,符合预期吗?有啥办法提升吗? 相关参数: openppl版本v0.8,intel 32核机器,均使用多线程,build命令:./build.sh -DPPLNN_USE_X86_64=ON -DPPLNN_ENABLE_ONNX_MODEL=OFF -DPPL_USE_X86_OMP=ON -DPPLNN_USE_OPENMP=ON 以下是测试数据: image

    opened by huangmiumang 9
  • [CUDA] `RuntimeBuilder.Preprocess()` causes subsequent CUDA function calls to fail

    [CUDA] `RuntimeBuilder.Preprocess()` causes subsequent CUDA function calls to fail

    What are the problems?(screenshots or detailed error messages)

    Observe that, for some models (e.g. YOLOX-s, DBNet-r18, others like ResNet-18 are fine), after creating runtime using RuntimeBuilder, subsequent CUDA function calls (or kernel launches) may fail.

    I first getting the CUDA invalid argument error when testing ppl.nn using mmdeploy's test.py, at a point after runtime creation, before inference, when copying data from host to device. Later I met the same problem when testing using mmdeploy's SDK.

    After digging around for a while, I found the the simplest way to reproduce the problem using pplnn.py:

    insert the following code

    import torch
    t = torch.Tensor([[1,1],[1,1]]).cuda()
    

    to https://github.com/openppl-public/ppl.nn/blob/1ae5d95f3ee49b3e582564cc004443931fbe2f7a/tools/pplnn.py#L564 and then

    python pplnn.py --use-cuda --onnx-model model.onnx --in-shape 1_3_640_640 --quick-select
    

    got

    INFO: PPLNN version: [0.8.0], commit: [02418bb57bef2d888b57d44589a599080cb806d9]
    [INFO][2022-07-06 22:23:06.057][utils.cc:456] total partition(s) of graph[torch-jit-export]: 1.
    [INFO][2022-07-06 22:23:06.067][opt_graph.cc:324] added 1020 new bridge kernels
    [INFO][2022-07-06 22:23:06.223][opt_graph.cc:581] deleted 990 bridge kernels
    Traceback (most recent call last):
      File "pplnn.py", line 567, in <module>
        t = torch.Tensor([[1,1],[1,1]]).cuda()
    RuntimeError: CUDA error: invalid argument
    

    Which version(commit id or tag) of ppl.nn is used?

    02418bb57bef2d888b57d44589a599080cb806d9

    What's the operating system ppl.nn runs on?

    Ubuntu 18.04

    What's the compiler and its version?

    GCC-7.5, CUDA-11.1

    What are the commands used to build ppl.nn?

    cmake .. \
        -DCMAKE_INSTALL_PREFIX=/workspace/ppl.nn/install \
        -DPPLNN_ENABLE_PYTHON_API=ON \
        -DPPLNN_USE_X86_64=ON \
        -DPPLNN_USE_CUDA=ON \
        -DPPL_USE_X86_AVX512=OFF \
        -DPPLNN_ENABLE_CUDA_JIT=OFF \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_CUDA_ARCHITECTURES=75
    
    opened by lzhangzz 9
  • Mask R-CNN failed with pplnn

    Mask R-CNN failed with pplnn

    The model was conveted from mmdetection library. And when I try to execute with pplnn, it shows error:

    [INFO][2021-07-14 17:18:19.999][pplnn.cc:703] ppl.nn version: 5d56662bf5a288898f0dd5b90f763459cc86f47a
    [WARNING][2021-07-14 17:18:21.873][engine.cc:209] Default input dims for dynamic graph are 1_3_224_224, we recommend using '--dims' to set a suitable training shape.
    [INFO][2021-07-14 17:18:21.873][pplnn.cc:104] ***** register CudaEngine *****
    [INFO][2021-07-14 17:18:22.320][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export]: 1.
    [ERROR][2021-07-14 17:18:22.322][reshape_reshape.cc:66] infer shape failed.
    [ERROR][2021-07-14 17:18:22.338][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.339][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.340][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.341][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.341][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.342][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.343][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.343][reshape_unsqueeze.cc:36] axes overflow.
    [ERROR][2021-07-14 17:18:22.343][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.344][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.344][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.345][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.345][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.345][reshape_split.cc:59] splited axis and sum of split point not match.
    [INFO][2021-07-14 17:18:22.346][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export1]: 1.
    [INFO][2021-07-14 17:18:22.346][opt_graph.cc:204] Create 2 TensorImpl
    [INFO][2021-07-14 17:18:22.346][opt_graph.cc:316] added 2 new bridge kernels
    [INFO][2021-07-14 17:18:22.346][opt_graph.cc:478] deleted 1 bridge kernels
    [INFO][2021-07-14 17:18:22.347][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export2]: 1.
    [INFO][2021-07-14 17:18:22.347][opt_graph.cc:204] Create 20 TensorImpl
    [INFO][2021-07-14 17:18:22.347][opt_graph.cc:316] added 21 new bridge kernels
    [INFO][2021-07-14 17:18:22.347][opt_graph.cc:478] deleted 14 bridge kernels
    [ERROR][2021-07-14 17:18:22.348][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.348][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.348][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.349][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.349][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.349][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.350][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.350][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.350][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.389][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.389][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.390][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.390][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.390][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.391][reshape_add.cc:39] unbroadcastable input.
    [ERROR][2021-07-14 17:18:22.391][reshape_unsqueeze.cc:36] axes overflow.
    [ERROR][2021-07-14 17:18:22.391][reshape_unsqueeze.cc:36] axes overflow.
    [INFO][2021-07-14 17:18:22.392][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export3]: 1.
    [INFO][2021-07-14 17:18:22.392][opt_graph.cc:204] Create 2 TensorImpl
    [INFO][2021-07-14 17:18:22.392][opt_graph.cc:316] added 2 new bridge kernels
    [INFO][2021-07-14 17:18:22.392][opt_graph.cc:478] deleted 1 bridge kernels
    [INFO][2021-07-14 17:18:22.392][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export4]: 1.
    [INFO][2021-07-14 17:18:22.393][opt_graph.cc:204] Create 20 TensorImpl
    [INFO][2021-07-14 17:18:22.393][opt_graph.cc:316] added 21 new bridge kernels
    [INFO][2021-07-14 17:18:22.408][opt_graph.cc:478] deleted 14 bridge kernels
    [ERROR][2021-07-14 17:18:22.408][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.409][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.409][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.409][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.410][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.410][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.410][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.411][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.411][reshape_split.cc:59] splited axis and sum of split point not match.
    [ERROR][2021-07-14 17:18:22.413][reshape_split.cc:59] splited axis and sum of split point not match.
    [INFO][2021-07-14 17:18:22.426][simple_graph_partitioner.cc:107] total partition(s) of graph[torch-jit-export5]: 1.
    [ERROR][2021-07-14 17:18:22.426][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.427][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.428][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.428][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.428][reshape_concat.cc:42] input shape not match.
    [ERROR][2021-07-14 17:18:22.429][reshape_concat.cc:42] input shape not match.
    [INFO][2021-07-14 17:18:22.429][opt_graph.cc:204] Create 135 TensorImpl
    [INFO][2021-07-14 17:18:22.430][opt_graph.cc:316] added 174 new bridge kernels
    [INFO][2021-07-14 17:18:22.433][opt_graph.cc:478] deleted 153 bridge kernels
    [INFO][2021-07-14 17:18:22.434][opt_graph.cc:204] Create 2263 TensorImpl
    [INFO][2021-07-14 17:18:22.660][opt_graph.cc:316] added 2626 new bridge kernels
    [INFO][2021-07-14 17:20:05.963][opt_graph.cc:478] deleted 2547 bridge kernels
    [ERROR][2021-07-14 17:20:06.007][scheduler_common.cc:170] exec kernel[Pad_146] failed: invalid value
    [ERROR][2021-07-14 17:20:06.007][sequential_scheduler.cc:116] execute kernel[Pad_146] failed: invalid value
    [ERROR][2021-07-14 17:20:06.007][pplnn.cc:804] Run() failed: invalid value
    

    I'm running it with true image data. Dose that pplnn support maskrcnn, or what should I do to execute it suceessfully? Thanks a lot! The model was generated by this command:

    python ../tools/deployment/pytorch2onnx.py ../configs/mask_rcnn/mask_rcnn_r50_fpn_mstrain-poly_3x_coco.py \
    mask_rcnn_r50_fpn_mstrain-poly_3x_coco_20210524_201154-21b550bb.pth \
    --output-file mask_rcnn.onnx --simplify --dynamic-export
    
    opened by Maosquerade 9
  • tools/pplnn.py --use-cuda output error

    tools/pplnn.py --use-cuda output error

    What are the problems?(screenshots or detailed error messages)

    use ./tools/pplnn.py --use-cuda --onnx-model tests/testdata/conv.onnx to test python api and cuda engine; add input and output data value print to https://github.com/openppl-public/ppl.nn/blob/master/tools/pplnn.py#L499 and https://github.com/openppl-public/ppl.nn/blob/master/tools/pplnn.py#L511 it seems that input tensor and output tensor have the same value; which is different from x86 engine output;

    INFO: PPLNN version: [0.6.3], commit: [9444a9d2ee0b89d8cd4a2fee8cef839fedfe8837]
    [INFO][2022-04-19 18:43:40.768][engine_graph_partitioner.cc:103] total partition(s) of graph[torch-jit-export]: 1.
    [INFO][2022-04-19 18:43:40.768][opt_graph.cc:329] added 4 new bridge kernels
    [INFO][2022-04-19 18:43:40.770][algo_conv_hmma.cc:129] Compiling Conv_0
    [INFO][2022-04-19 18:43:41.454][opt_graph.cc:583] deleted 2 bridge kernels
    INFO: ----- input info -----
    INFO: input[0]
    INFO:     name: input
    INFO:     dim(s): [1, 3, 4, 4]
    INFO:     type: FLOAT32
    INFO:     format: NDARRAY
    INFO:     byte(s) excluding padding: 192
    INFO:     in_data: [[[[-0.7580919  -1.0537796  -1.4523766  -1.1736736 ]
       [-0.50453496 -1.48383    -1.3174736  -0.8811438 ]
       [-1.5446684  -0.33240414 -1.429975   -1.172169  ]
       [-1.2639251  -0.00716734 -0.26453447 -1.4403057 ]]
    
      [[-1.6206262  -1.3826382  -0.74133873 -0.9391637 ]
       [-0.42861128 -0.09090185 -1.2538221  -0.02137303]
       [-0.074507   -0.29974604 -0.45086026 -1.9801757 ]
       [-0.07279325 -0.67775655 -1.4832225  -1.862076  ]]
    
      [[-1.0764339  -0.25367737 -1.8603811  -1.5876365 ]
       [-1.8216178  -0.6460962  -0.5559113  -0.9660294 ]
       [-1.837322   -1.0467303  -0.04060197 -0.5114651 ]
       [-0.21527338 -0.26388478 -1.6131785  -1.4633346 ]]]]
    INFO: ----- output info -----
    INFO: output[0]
    INFO:     name: 5
    INFO:     dim(s): [1, 3, 5, 5]
    INFO:     type: FLOAT32
    INFO:     format: NDARRAY
    INFO:     byte(s) excluding padding: 300
    INFO:     out_data: [[[[-0.7580919  -1.0537796  -1.4523766  -1.1736736  -0.50453496]
       [-1.48383    -1.3174736  -0.8811438  -1.5446684  -0.33240414]
       [-1.429975   -1.172169   -1.2639251  -0.00716734 -0.26453447]
       [-1.4403057  -1.6206262  -1.3826382  -0.74133873 -0.9391637 ]
       [-0.42861128 -0.09090185 -1.2538221  -0.02137303 -0.074507  ]]
    
      [[-0.29974604 -0.45086026 -1.9801757  -0.07279325 -0.67775655]
       [-1.4832225  -1.862076   -1.0764339  -0.25367737 -1.8603811 ]
       [-1.5876365  -1.8216178  -0.6460962  -0.5559113  -0.9660294 ]
       [-1.837322   -1.0467303  -0.04060197 -0.5114651  -0.21527338]
       [-0.26388478 -1.6131785  -1.4633346   0.          0.        ]]
    
      [[ 0.          0.          0.          0.          0.        ]
       [ 0.          0.          0.          0.          0.        ]
       [ 0.          0.          0.          0.          0.        ]
       [ 0.          0.          0.          0.          0.        ]
       [ 0.          0.          0.          0.          0.        ]]]]
    INFO: Run ok
    

    Which version(commit id or tag) of ppl.nn is used?

    PPLNN version: [0.6.3], commit: [9444a9d2ee0b89d8cd4a2fee8cef839fedfe8837]

    What's the operating system ppl.nn runs on?

    ubuntu18.04

    What's the compiler and its version?

    g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

    What are the commands used to build ppl.nn?

    ./build.sh -DHPCC_USE_X86_64=ON -DPPLNN_ENABLE_PYTHON_API=ON -DHPCC_USE_CUDA=ON

    What are the execution commands?

    PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-cuda --onnx-model tests/testdata/conv.onnx

    minimal code snippets for reproducing these problems(if necessary)

    models and inputs for reproducing these problems (send them to [email protected] if necessary)

    opened by sky-fun 8
  • cuda推理报错

    cuda推理报错

    What are the problems?(snapshots or detailed error messages)

    将cpp的分类示例工程改为使用cuda推理(x86可以正常编译运行,benchmark cuda和x86也都可以跑),编译时打印以下内容:

    $ bear make -j
    Consolidate compiler generated dependencies of target classification
    [ 50%] Building CXX object CMakeFiles/classification.dir/classification.cpp.o
    [100%] Linking CXX executable classification
    /home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(engine_factory.cc.o):在函数‘ppl::nn::CudaEngineFactory::Create(ppl::nn::CudaEngineOptions const&)’中:
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.h:42:对‘cuModuleUnload’未定义的引用
    /home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(engine.cc.o):在函数‘ppl::nn::cuda::CudaEngine::~CudaEngine()’中:
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.h:42:对‘cuModuleUnload’未定义的引用
    /home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(engine.cc.o):在函数‘ppl::nn::cuda::CudaEngine::~CudaEngine()’中:
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.h:42:对‘cuModuleUnload’未定义的引用
    /home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(cuda_compiler.cc.o):在函数‘ppl::nn::cuda::CUDANVRTCCompile(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::vector<char const*, std::allocator<char const*> >, int, bool)’中:
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:44:对‘nvrtcCreateProgram’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:45:对‘nvrtcCompileProgram’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:48:对‘nvrtcGetProgramLogSize’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:51:对‘nvrtcGetProgramLog’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:56:对‘nvrtcGetPTXSize’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:59:对‘nvrtcGetPTX’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:60:对‘nvrtcDestroyProgram’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:61:对‘cudaDeviceSynchronize’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:60:对‘nvrtcGetErrorString’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:44:对‘nvrtcGetErrorString’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:59:对‘nvrtcGetErrorString’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_compiler.cc:56:对‘nvrtcGetErrorString’未定义的引用
    /home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(cuda_module.cc.o):在函数‘ppl::nn::cuda::CUDAModule::GetKernelFunc()’中:
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:25:对‘cuModuleLoadDataEx’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:25:对‘cuGetErrorName’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:28:对‘cuModuleGetFunction’未定义的引用
    /home/ubuntu/Documents/ppl.nn/src/ppl/nn/engines/cuda/module/cuda_module.cc:28:对‘cuGetErrorName’未定义的引用
    /home/ubuntu/Documents/ppl.nn/pplnn-build/install/lib/cmake/ppl/../../../lib/libpplnn_static.a(cuda_module.cc.o):在函数‘ppl::nn::cuda::CUDAModule::GetKernelFunc(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)’中:
    ...
    ...
    ...
    

    Which version(commit id or tag) of ppl.nn is used?

    ppl.nn version: 0a545145b6b1816fd190c6023a588328872fe80f

    What's the operating system ppl.nn runs on?

    Linux ubuntu-1660ti 5.4.0-100-generic #113~18.04.1-Ubuntu SMP Mon Feb 7 15:02:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

    What's the compiler and its version?

    我使用了两个版本的gcc,都不行

    • gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
    • gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

    What are the commands used to build ppl.nn?

    ./build.sh -DPPLNN_ENABLE_PYTHON_API=ON -DHPCC_USE_X86_64=ON -DHPCC_USE_CUDA=ON

    What are the execution commands?

    bear make -j

    minimal code snippets for reproducing these problems(if necessary)

    #include "ppl/nn/engines/cuda/cuda_engine_options.h"
    #include "ppl/nn/engines/cuda/engine_factory.h"
    ...
    /************************ 2. create runtime builder from onnx model *************************/
        CudaEngineOptions options;
        options.device_id = 0;
        options.mm_policy = CUDA_MM_BEST_FIT;
    
        auto cuda_engine = CudaEngineFactory::Create(options);
        if (!cuda_engine)
        {
            return false;
        }
        cuda_engine->Configure(ppl::nn::CUDA_CONF_USE_DEFAULT_ALGORITHMS, false);
        vector<unique_ptr<Engine>> engines;
        vector<Engine *> engine_ptrs;
        engines.emplace_back(unique_ptr<Engine>(cuda_engine));
        engine_ptrs.emplace_back(engines[0].get());
    ...
    

    models and inputs for reproducing these problems (sends them to [email protected] if necessary)

    opened by watersounds 8
  • centernet runs with memory error.

    centernet runs with memory error.

    My gpu is Tesla T4, and sample model runs normally. When I use centernet with --mm-policy=mem, it turns out erorr like this, but it can get an output. image WHen I use --mm-policy=perf, it gets error out of memory like this: image It seems they both end with memory error, is this error familiar to your team, or how can I avoid this error?

    opened by Maosquerade 8
  • About the core J1900 run the python demo occur get unsupported isa 0

    About the core J1900 run the python demo occur get unsupported isa 0

    Screenshot from 2022-03-24 10-39-10

    cpu core J1900 vendor_id : GenuineIntel cpu family : 6 model : 55 model name : Intel(R) Celeron(R) CPU J1900 @ 1.99GHz stepping : 9 microcode : 0x90c cpu MHz : 2042.652 cache size : 1024 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat md_clear bugs : cpu_meltdown spectre_v1 spectre_v2 mds msbds_only bogomips : 4000.00 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual

    gcc 7.5.0 os:Ubuntu18.04 LTS PPLNN version: [0.6.3]

    I use the command: PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-x86 --onnx-model tests/testdata/conv.onnx

    i find maybe the core is too elder that not support this?

    opened by F0xZz 7
  • [x86-compile] error: impossible constraint in ‘asm’

    [x86-compile] error: impossible constraint in ‘asm’

    I try to compile the latest master.

    CPU | result ------- | ------------- Core i5-9500(not support avx512) | error: impossible constraint in ‘asm’ Xeon 6130(support avx512) | pass

    I find that latest commit supports AVX-512. If it is a bug, will ppl support more CPU(no avx512) and any macro to separate AVX-512 codes? Thanks.

    opened by alanzhai219 7
  • Why sample model tests/testdata/conv.onnx  CANNOT be profiled ?

    Why sample model tests/testdata/conv.onnx CANNOT be profiled ?

    I build ppl.nn project, and try one test by ./pplnn-build/tools/pplnn --onnx-model tests/testdata/conv.onnx, it works normally,

    [INFO][2021-07-05 21:47:39.764][pplnn.cc:683] ppl.nn version: 7dd75a1077867fc9a762449953417088446ae2f8-dirty
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:110] ***** register X86Engine *****
    [INFO][2021-07-05 21:47:39.764][simple_graph_partitioner.cc:90] total partition(s) of graph[torch-jit-export]: 1.
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:523] ----- input info -----
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:526] input[0]:
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:527]     name: input
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:534]     dim(s): 1 3 4 4
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:536]     DataType: FLOAT32
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:537]     DataFormat: NDARRAY
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:538]     NumBytesIncludePadding: 192
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:539]     NumBytesExcludePadding: 192
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:542] ----- output info -----
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:545] output[0]:
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:546]     name: 5
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:553]     dim(s): 1 3 5 5
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:555]     DataType: FLOAT32
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:556]     DataFormat: N16CX
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:557]     NumBytesIncludePadding: 1600
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:558]     NumBytesExcludePadding: 300
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:561] ----------------------
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:791] Run() costs: 0.010000 ms.
    [INFO][2021-07-05 21:47:39.764][pplnn.cc:799] Run ok
    

    when I try to run in profile mode, It got stuck at some where and never return. 😅😅 code it too new to me, hard to find cause, pls help~

    comand: ./pplnn-build/tools/pplnn --onnx-model tests/testdata/conv.onnx --enable-profiling --warmuptimes 3

    [INFO][2021-07-05 21:52:35.459][pplnn.cc:683] ppl.nn version: 7dd75a1077867fc9a762449953417088446ae2f8-dirty
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:110] ***** register X86Engine *****
    [INFO][2021-07-05 21:52:35.459][simple_graph_partitioner.cc:90] total partition(s) of graph[torch-jit-export]: 1.
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:523] ----- input info -----
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:526] input[0]:
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:527]     name: input
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:534]     dim(s): 1 3 4 4
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:536]     DataType: FLOAT32
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:537]     DataFormat: NDARRAY
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:538]     NumBytesIncludePadding: 192
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:539]     NumBytesExcludePadding: 192
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:542] ----- output info -----
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:545] output[0]:
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:546]     name: 5
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:553]     dim(s): 1 3 5 5
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:555]     DataType: FLOAT32
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:556]     DataFormat: N16CX
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:557]     NumBytesIncludePadding: 1600
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:558]     NumBytesExcludePadding: 300
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:561] ----------------------
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:791] Run() costs: 0.010000 ms.
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:799] Run ok
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:803] Warm up start for 3 times.
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:810] Warm up end.
    [INFO][2021-07-05 21:52:35.459][pplnn.cc:818] Profiling start
    
    opened by codgeek 7
  • WSL complie error: failed to convert GOTPCREL relocation; relink with --no-relax

    WSL complie error: failed to convert GOTPCREL relocation; relink with --no-relax

    I want to complie the classification model under the path: ppl.nn/pplnn-build/samples/cpp/run_model, but get this issue. the complie log is : ` [ 1%] Built target pplcommon_static Consolidate compiler generated dependencies of target PPLCUDAKernel [ 10%] Built target PPLCUDAKernel [ 18%] Built target libprotobuf [ 43%] Built target PPLKernelX86 Consolidate compiler generated dependencies of target pplnn_static [100%] Built target pplnn_static Consolidate compiler generated dependencies of target classification [100%] Linking CXX executable classification /usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax collect2: error: ld returned 1 exit status samples/cpp/run_model/CMakeFiles/classification.dir/build.make:150: recipe for target 'samples/cpp/run_model/classification' failed make[2]: *** [samples/cpp/run_model/classification] Error 1 CMakeFiles/Makefile2:1040: recipe for target 'samples/cpp/run_model/CMakeFiles/classification.dir/all' failed make[1]: *** [samples/cpp/run_model/CMakeFiles/classification.dir/all] Error 2 Makefile:135: recipe for target 'all' failed make: *** [all] Error 2

    ` my cuda version is 11.0

    opened by szh6 6
  • PPL.NN slower than onnxruntime

    PPL.NN slower than onnxruntime

    I can't get expected performance on my machine Intel® Core™ i7-10700 CPU @ 2.90GHz × 16:

    image

    the model is simply yolov5s

    why? I have already export NUM_THREADS to 8, it can not even beat onnxruntime single thread.....

    opened by jinfagang 0
  • ubuntu python build failed....

    ubuntu python build failed....

    ppl.nn/deps/pybind11/include/pybind11/detail/../detail/common.h:213:10: fatal error: Python.h: No such file or directory
      213 | #include <Python.h>
          |          ^~~~~~~~~~
    
    

    env:

    locate Python.h                                                                                2 ↵
    /usr/include/python3.10/Python.h
    
    opened by jinfagang 1
  • x86 build failed;

    x86 build failed;

    ppl.nn/deps/pplcommon/python/py_types.cc:19:
    /usr/include/python3.10/pyconfig.h:3:12: fatal error: x86_64-linux-gnu/python3.10/pyconfig.h: No such file or directory
        3 | #  include <x86_64-linux-gnu/python3.10/pyconfig.h>
          |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    
    opened by jinfagang 0
  • ppl推理问题 runtime.Run()执行次数不同,单次推理时间不同,对于resent50,Run10次,单次ppl_speed是4.7,Run1000次,单次ppl_speed是23.4

    ppl推理问题 runtime.Run()执行次数不同,单次推理时间不同,对于resent50,Run10次,单次ppl_speed是4.7,Run1000次,单次ppl_speed是23.4

    What are the problems?(screenshots or detailed error messages)

    Which version(commit id or tag) of ppl.nn is used?

    What's the operating system ppl.nn runs on?

    What's the compiler and its version?

    What are the commands used to build ppl.nn?

    What are the execution commands?

    minimal code snippets for reproducing these problems(if necessary)

    models and inputs for reproducing these problems (send them to [email protected] if necessary)

    opened by geniusjunjun 0
  • x86支持fp16推理吗?支持的话输入数据是要转为fp16类型或者还是用fp32数据推理过程会自动转为fp16?pplnn目前只支持T4 gpu吗 别的gpu支持吗

    x86支持fp16推理吗?支持的话输入数据是要转为fp16类型或者还是用fp32数据推理过程会自动转为fp16?pplnn目前只支持T4 gpu吗 别的gpu支持吗

    What are the problems?(screenshots or detailed error messages)

    Which version(commit id or tag) of ppl.nn is used?

    What's the operating system ppl.nn runs on?

    What's the compiler and its version?

    What are the commands used to build ppl.nn?

    What are the execution commands?

    minimal code snippets for reproducing these problems(if necessary)

    models and inputs for reproducing these problems (send them to [email protected] if necessary)

    opened by royywang 0
  • ppl.nn是通过看哪个结果数据来衡量被测服务器CPU的性能的?

    ppl.nn是通过看哪个结果数据来衡量被测服务器CPU的性能的?

    想用ppl.nn作为benchmark来测服务器CPU性能,有几个问题:

    1. ppl.nn是通过看哪个结果数据来衡量被测服务器CPU的性能的?结果数据怎么解读?
    2. build.sh编译时可否支持-DPPLNN_USE_ARMV9=ON
    3. 有什么办法能控制input的dim(s): 1 3 630 160这四个参数值?想指定这几个参数来控制运算量大小
    opened by CodeDance364 2
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Tencent 502 Jul 31, 2022
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

NVIDIA Corporation 5.7k Jul 29, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator compatible with deep learning frameworks, PyTorch and TensorFlow/Keras, as well as classical machine learning libraries such as scikit-learn, and more.

Microsoft 7.3k Aug 6, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Jul 30, 2022
Examples for using ONNX Runtime for machine learning inferencing.

Examples for using ONNX Runtime for machine learning inferencing.

Microsoft 267 Jul 27, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

NetEase Youdao 176 Jul 21, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Tencent 110 Jul 16, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

null 75 Apr 14, 2022
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

iwatake 154 Aug 1, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

Xiaomi 4.7k Jul 28, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

NeuroSim 24 Jul 19, 2022
Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

OpenAI 3.8k Aug 1, 2022
An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

手写AI 1.1k Jul 29, 2022
ncnn is a high-performance neural network inference framework optimized for the mobile platform

ncnn ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployme

Tencent 15.1k Aug 2, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Fatih Küçükkarakurt 5 Apr 5, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 52 Jul 28, 2022
Simple inference deep head pose ncnn version

ncnn-deep-head-pose Simple implement inference deep head pose ncnn version with high performance and optimized resource. This project based on deep-he

Đỗ Công Minh 11 Jun 13, 2022
Nimble: Physics Engine for Deep Learning

Nimble: Physics Engine for Deep Learning

Keenon Werling 271 Aug 2, 2022