Triton Python and C++ client libraries and example, and client examples for go, java and scala.

Overview

License

Triton Client Libraries and Examples

To simplify communication with Triton, the Triton project provides several client libraries and examples of how to use those libraries. Ask questions or report problems in the main Triton issues page.

The provided client libaries are:

  • C++ and Python APIs that make it easy to communicate with Triton from your C++ or Python application. Using these libraries you can send either HTTP/REST or GRPC requests to Triton to access all its capabilities: inferencing, status and health, statistics and metrics, model repository management, etc. These libraries also support using system and CUDA shared memory for passing inputs to and receiving outputs from Triton.

  • The protoc compiler can generate a GRPC API in a large number of programming languages. See src/go for an example for the Go programming language. See src/java for an example for the Java and Scala programming languages.

There are also many example applications that show how to use these libraries. Many of these examples use models from the example model repository.

  • C++ and Python versions of image_client, an example application that uses the C++ or Python client library to execute image classification models on Triton. See Image Classification Example.

  • Several simple C++ examples show how to use the C++ library to communicate with Triton to perform inferencing and other task. The C++ examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix. See Simple Example Applications.

  • Several simple Python examples show how to use the Python library to communicate with Triton to perform inferencing and other task. The Python examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix. See Simple Example Applications.

  • A couple of Python examples that communicate with Triton using a Python GRPC API generated by the protoc compiler. grpc_client.py is a simple example that shows simple API usage. grpc_image_client.py is functionally equivalent to image_client but that uses a generated GRPC client stub to communicate with Triton.

Getting the Client Libraries And Examples

The easiest way to get the Python client library is to use pip to install the tritonclient module. You can also download both C++ and Python client libraries from Triton GitHub release, or download a pre-built Docker image containing the client libraries from NVIDIA GPU Cloud (NGC).

It is also possible to build build the client libraries with cmake.

Download Using Python Package Installer (pip)

The GRPC and HTTP client libraries are available as a Python package that can be installed using a recent version of pip. Currently pip install is only available on Linux.

$ pip install nvidia-pyindex
$ pip install tritonclient[all]

Using all installs both the HTTP/REST and GRPC client libraries. There are two optional packages available, grpc and http that can be used to install support specifically for the protocol. For example, to install only the HTTP/REST client library use,

$ pip install nvidia-pyindex
$ pip install tritonclient[http]

The components of the install packages are:

  • http
  • grpc [ service_pb2, service_pb2_grpc, model_config_pb2 ]
  • utils [ linux distribution will include shared_memory and cuda_shared_memory]

The Linux version of the package also includes the perf_analyzer binary. The perf_analyzer binary is built on Ubuntu 20.04 and may not run on other Linux distributions. To run the perf_analyzer the following dependency must be installed:

sudo apt update
sudo apt install libb64-dev

Download From GitHub

The client libraries and the perf_analyzer executable can be downloaded from the Triton GitHub release page corresponding to the release you are interested in. The client libraries are found in the "Assets" section of the release page in a tar file named after the version of the release and the OS, for example, v2.3.0_ubuntu2004.clients.tar.gz.

The pre-built libraries can be used on the corresponding host system or you can install them into the Triton container to have both the clients and server in the same container.

$ mkdir clients
$ cd clients
$ wget https://github.com/triton-inference-server/server/releases/download/<tarfile_path>
$ tar xzf <tarfile_name>

After installing, the libraries can be found in lib/, the headers in include/, and the Python wheel files in python/. The bin/ and python/ directories contain the built examples that you can learn more about below.

The perf_analyzer binary is built on Ubuntu 20.04 and may not run on other Linux distributions. To use the C++ libraries or perf_analyzer executable you must install some dependencies.

$ apt-get update
$ apt-get install curl libcurl4-openssl-dev libb64-dev

Download Docker Image From NGC

A Docker image containing the client libraries and examples is available from NVIDIA GPU Cloud (NGC). Before attempting to pull the container ensure you have access to NGC. For step-by-step instructions, see the NGC Getting Started Guide.

Use docker pull to get the client libraries and examples container from NGC.

$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk

Where <xx.yy> is the version that you want to pull. Within the container the client libraries are in /workspace/install/lib, the corresponding headers in /workspace/install/include, and the Python wheel files in /workspace/install/python. The image will also contain the built client examples.

Build Using CMake

The client library build is performed using CMake. To build the client libraries and examples with all features, first change directory to the root of this repo and checkout the release version of the branch that you want to build (or the main branch if you want to build the under-development version).

$ git checkout main

Building on Windows vs. non-Windows requires different invocations because Triton on Windows does not yet support all the build options.

Non-Windows

Use cmake to configure the build.

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_HTTP=ON -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_PERF_ANALYZER=ON -DTRITON_ENABLE_PYTHON_HTTP=ON -DTRITON_ENABLE_PYTHON_GRPC=ON -DTRITON_ENABLE_GPU=ON -DTRITON_ENABLE_EXAMPLES=ON -DTRITON_ENABLE_TESTS=ON ..

Then use make to build the clients and examples.

$ make cc-clients python-clients

When the build completes the libraries and examples can be found in the install directory.

Windows

To build the clients you must install an appropriate C++ compiler and other dependencies required for the build. The easiest way to do this is to create the Windows min Docker image and the perform the build within a container launched from that image.

> docker run  -it --rm win10-py3-min powershell

It is not necessary to use Docker or the win10-py3-min container for the build, but if you do not you must install the appropriate dependencies onto your host system.

Next use cmake to configure the build. If you are not building within the win10-py3-min container then you will likely need to adjust the CMAKE_TOOLCHAIN_FILE location in the following command.

$ mkdir build
$ cd build
$ cmake -DVCPKG_TARGET_TRIPLET=x64-windows -DCMAKE_TOOLCHAIN_FILE='/vcpkg/scripts/buildsystems/vcpkg.cmake' -DCMAKE_INSTALL_PREFIX=install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_PYTHON_GRPC=ON -DTRITON_ENABLE_GPU=OFF -DTRITON_ENABLE_EXAMPLES=ON -DTRITON_ENABLE_TESTS=ON ..

Then use msbuild.exe to build.

$ msbuild.exe cc-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly
$ msbuild.exe python-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly

When the build completes the libraries and examples can be found in the install directory.

Client Library APIs

The C++ client API exposes a class-based interface. The commented interface is available in grpc_client.h, http_client.h, common.h.

The Python client API provides similar capabilities as the C++ API. The commented interface is available in grpc and http.

GRPC Options

SSL/TLS

The client library allows communication across a secured channel using gRPC protocol.

For C++ client, see SslOptions struct that encapsulates these options in grpc_client.h.

For Python client, look for the following options in grpc/init.py:

  • ssl
  • root_certificates
  • private_key
  • certificate_chain

The C++ and Python examples demonstrates how to use SSL/TLS settings on client side. For information on the corresponding server-side parameters, refer to the server documentation

Compression

The client library also exposes options to use on-wire compression for gRPC transactions.

For C++ client, see compression_algorithm parameter in the Infer, AsyncInfer and StartStream functions in grpc_client.h. By default, the parameter is set as GRPC_COMPRESS_NONE.

Similarly, for Python client, see compression_algorithm parameter in infer, async_infer and start_stream functions in grpc/init.py:

The C++ and Python examples demonstrates how to configure compression for clients. For information on the corresponding server-side parameters, refer to the server documentation

GRPC KeepAlive

Triton exposes GRPC KeepAlive parameters with the default values for both client and server described here.

You can find a KeepAliveOptions struct/class that encapsulates these parameters in both the C++ and Python client libraries.

There is also a C++ and Python example demonstrating how to setup these parameters on the client-side. For information on the corresponding server-side parameters, refer to the server documentation

Simple Example Applications

This section describes several of the simple example applications and the features that they illustrate.

Bytes/String Datatype

Some frameworks support tensors where each element in the tensor is variable-length binary data. Each element can hold a string or an arbitrary sequence of bytes. On the client this datatype is BYTES (see Datatypes for information on supported datatypes).

The Python client library uses numpy to represent input and output tensors. For BYTES tensors the dtype of the numpy array should be 'np.object_' as shown in the examples. For backwards compatibility with previous versions of the client library, 'np.bytes_' can also be used for BYTES tensors. However, using 'np.bytes_' is not recommended because using this dtype will cause numpy to remove all trailing zeros from each array element. As a result, binary sequences ending in zero(s) will not be represented correctly.

BYTES tensors are demonstrated in the C++ example applications simple_http_string_infer_client.cc and simple_grpc_string_infer_client.cc. String tensors are demonstrated in the Python example application simple_http_string_infer_client.py and simple_grpc_string_infer_client.py.

System Shared Memory

Using system shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases.

Using system shared memory is demonstrated in the C++ example applications simple_http_shm_client.cc and simple_grpc_shm_client.cc. Using system shared memory is demonstrated in the Python example application simple_http_shm_client.py and simple_grpc_shm_client.py.

Python does not have a standard way of allocating and accessing shared memory so as an example a simple system shared memory module is provided that can be used with the Python client library to create, set and destroy system shared memory.

CUDA Shared Memory

Using CUDA shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases.

Using CUDA shared memory is demonstrated in the C++ example applications simple_http_cudashm_client.cc and simple_grpc_cudashm_client.cc. Using CUDA shared memory is demonstrated in the Python example application simple_http_cudashm_client.py and simple_grpc_cudashm_client.py.

Python does not have a standard way of allocating and accessing shared memory so as an example a simple CUDA shared memory module is provided that can be used with the Python client library to create, set and destroy CUDA shared memory.

Client API for Stateful Models

When performing inference using a stateful model, a client must identify which inference requests belong to the same sequence and also when a sequence starts and ends.

Each sequence is identified with a sequence ID that is provided when an inference request is made. It is up to the clients to create a unique sequence ID. For each sequence the first inference request should be marked as the start of the sequence and the last inference requests should be marked as the end of the sequence.

The use of sequence ID and start and end flags are demonstrated in the C++ example applications simple_http_sequence_stream_infer_client.cc and simple_grpc_sequence_stream_infer_client.cc. The use of sequence ID and start and end flags are demonstrated in the Python example application simple_http_sequence_stream_infer_client.py and simple_grpc_sequence_stream_infer_client.py.

Image Classification Example

The image classification example that uses the C++ client API is available at src/c++/examples/image_client.cc. The Python version of the image classification client is available at src/python/examples/image_client.py.

To use image_client (or image_client.py) you must first have a running Triton that is serving one or more image classification models. The image_client application requires that the model have a single image input and produce a single classification output. If you don't have a model repository with image classification models see QuickStart for instructions on how to create one.

Once Triton is running you can use the image_client application to send inference requests. You can specify a single image or a directory holding images. Here we send a request for the inception_graphdef model for an image from the qa/images.

$ image_client -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG

The Python version of the application accepts the same command-line arguments.

$ python image_client.py -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
     0.826384 (505) = COFFEE MUG

The image_client and image_client.py applications use the client libraries to talk to Triton. By default image_client instructs the client library to use HTTP/REST protocol, but you can use the GRPC protocol by providing the -i flag. You must also use the -u flag to point at the GRPC endpoint on Triton.

$ image_client -i grpc -u localhost:8001 -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG

By default the client prints the most probable classification for the image. Use the -c flag to see more classifications.

$ image_client -m inception_graphdef -s INCEPTION -c 3 qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO

The -b flag allows you to send a batch of images for inferencing. The image_client application will form the batch from the image or images that you specified. If the batch is bigger than the number of images then image_client will just repeat the images to fill the batch.

$ image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images/mug.jpg
Request 0, batch size 2
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO

Provide a directory instead of a single image to perform inferencing on all images in the directory.

$ image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images
Request 0, batch size 2
Image '/opt/tritonserver/qa/images/car.jpg':
    0.819196 (818) = SPORTS CAR
    0.033457 (437) = BEACH WAGON
    0.031232 (480) = CAR WHEEL
Image '/opt/tritonserver/qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO
Request 1, batch size 2
Image '/opt/tritonserver/qa/images/vulture.jpeg':
    0.977632 (24) = VULTURE
    0.000613 (9) = HEN
    0.000560 (137) = EUROPEAN GALLINULE
Image '/opt/tritonserver/qa/images/car.jpg':
    0.819196 (818) = SPORTS CAR
    0.033457 (437) = BEACH WAGON
    0.031232 (480) = CAR WHEEL

The grpc_image_client.py application behaves the same as the image_client except that instead of using the client library it uses the GRPC generated library to communicate with Triton.

Ensemble Image Classification Example Application

In comparison to the image classification example above, this example uses an ensemble of an image-preprocessing model implemented as a DALI backend and a TensorFlow Inception model. The ensemble model allows you to send the raw image binaries in the request and receive classification results without preprocessing the images on the client.

To try this example you should follow the DALI ensemble example instructions.

Comments
  • feat: Fix generation, update go structure, add go module

    feat: Fix generation, update go structure, add go module

    I tried out the go grpc client but found some issues, let me know if you want me to break it up further.

    This pr aims to do the following fixes:

    • Plugins are not supported any longer, so update protoc command: https://github.com/golang/protobuf/issues/1070
    • Add protobuff option to allow for he go package generation: https://developers.google.com/protocol-buffers/docs/reference/go-generated#package
    • Add go.mod file and include package to not require everyone to generate it.
    • Update README.md

    Thank you!

    opened by NikeNano 22
  • Soften grpcio requirement in python library

    Soften grpcio requirement in python library

    Show warning when the imported version has memory leakage issue, but do allow users to proceed on their own risk

    As there are many packages depending on grpc, it is becoming exceedingly difficult to manage a pinned version. Also, depending on use case, memory leakage problem might be worked around if we regularly restart or otherwise manage servers and working nodes, which are often the case.

    opened by chajath 21
  • Aggregate trial statistics

    Aggregate trial statistics

    Aggregate trial statistics to report the average of trials instead of reporting the last trial.

    After

    Ensemble Model

    PA Output

    *** Measurement Settings ***
      Batch size: 1
      Using "time_windows" mode for stabilization
      Measurement window: 5000 msec
      Using synchronous calls for inference
      Stabilizing using average latency
    
    Request concurrency: 1
      Pass [1] throughput: 269.4 infer/sec. Avg latency: 3705 usec (std 220 usec)
      Pass [2] throughput: 267.4 infer/sec. Avg latency: 3733 usec (std 239 usec)
      Pass [3] throughput: 268.6 infer/sec. Avg latency: 3714 usec (std 228 usec)
      Client:
        Request count: 4027
        Throughput: 268.467 infer/sec
        Avg latency: 3717 usec (standard deviation 229 usec)
        p50 latency: 3736 usec
        p90 latency: 3983 usec
        p95 latency: 4010 usec
        p99 latency: 4049 usec
        Avg HTTP time: 3699 usec (send 148 usec + response wait 3550 usec + receive 1 usec)
      Server:
        Inference count: 4809
        Execution count: 4809
        Successful request count: 4809
        Avg request latency: 3089 usec (overhead 297 usec + queue 188 usec + compute 2604 usec)
    
      Composing models:
      add_sub_1, version:
          Inference count: 4809
          Execution count: 4809
          Successful request count: 4809
          Avg request latency: 1505 usec (overhead 167 usec + queue 87 usec + compute input 165 usec + compute infer 797 usec + compute output 288 usec)
    
      add_sub_2, version:
          Inference count: 4809
          Execution count: 4809
          Successful request count: 4809
          Avg request latency: 1624 usec (overhead 170 usec + queue 101 usec + compute input 164 usec + compute infer 813 usec + compute output 375 usec)
    
    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 268.467 infer/sec, latency 3717 usec
    

    Verbose CSV

    Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
    1,268.467,148,956,6,330,1610,664,1,3736,3983,4010,4049,3717,149,3550
    

    Sequence Model

    PA Output

    *** Measurement Settings ***
      Batch size: 1
      Using "time_windows" mode for stabilization
      Measurement window: 5000 msec
      Using asynchronous calls for inference
      Stabilizing using average latency
    
    Request concurrency: 1
      Pass [1] throughput: 1546 infer/sec. Avg latency: 625 usec (std 37 usec)
      Pass [2] throughput: 1525.6 infer/sec. Avg latency: 632 usec (std 40 usec)
      Pass [3] throughput: 1512.8 infer/sec. Avg latency: 636 usec (std 55 usec)
      Client:
        Request count: 22922
        Sequence count: 1144 (76.2667 seq/sec)
        Throughput: 1528.13 infer/sec
        Avg latency: 631 usec (standard deviation 45 usec)
        p50 latency: 625 usec
        p90 latency: 653 usec
        p95 latency: 694 usec
        p99 latency: 825 usec
        Avg HTTP time: 596 usec (send 34 usec + response wait 562 usec + receive 0 usec)
      Server:
        Inference count: 27542
        Execution count: 27542
        Successful request count: 27542
        Avg request latency: 277 usec (overhead 57 usec + queue 58 usec + compute input 66 usec + compute infer 74 usec + compute output 21 usec)
    
    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 1528.13 infer/sec, latency 631 usec
    

    Verbose CSV

    Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
    1,1528.13,34,376,58,66,74,21,0,625,653,694,825,631,34,562
    

    Normal Model

    PA Output

    *** Measurement Settings ***
      Batch size: 1
      Using "time_windows" mode for stabilization
      Measurement window: 5000 msec
      Using synchronous calls for inference
      Stabilizing using average latency
    
    Request concurrency: 1
      Pass [1] throughput: 489.8 infer/sec. Avg latency: 2033 usec (std 206 usec)
      Pass [2] throughput: 486 infer/sec. Avg latency: 2050 usec (std 172 usec)
      Pass [3] throughput: 490.8 infer/sec. Avg latency: 2030 usec (std 200 usec)
      Client:
        Request count: 7333
        Throughput: 488.867 infer/sec
        Avg latency: 2038 usec (standard deviation 194 usec)
        p50 latency: 2090 usec
        p90 latency: 2236 usec
        p95 latency: 2279 usec
        p99 latency: 2342 usec
        Avg HTTP time: 2000 usec (send 142 usec + response wait 1857 usec + receive 1 usec)
      Server:
        Inference count: 8806
        Execution count: 8806
        Successful request count: 8806
        Avg request latency: 1395 usec (overhead 167 usec + queue 80 usec + compute input 157 usec + compute infer 785 usec + compute output 206 usec)
    
    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 488.867 infer/sec, latency 2038 usec
    

    Verbose CSV

    Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
    1,488.867,142,664,80,157,785,206,1,2090,2236,2279,2342,2038,143,1857
    

    Before

    Ensemble Model

    PA Output

    *** Measurement Settings ***
      Batch size: 1
      Using "time_windows" mode for stabilization
      Measurement window: 5000 msec
      Using synchronous calls for inference
      Stabilizing using average latency
    
    Request concurrency: 1
      Pass [1] throughput: 269.8 infer/sec. Avg latency: 3697 usec (std 230 usec)
      Pass [2] throughput: 266.8 infer/sec. Avg latency: 3739 usec (std 271 usec)
      Pass [3] throughput: 269.4 infer/sec. Avg latency: 3703 usec (std 267 usec)
      Client:
        Request count: 1347
        Throughput: 269.4 infer/sec
        Avg latency: 3703 usec (standard deviation 267 usec)
        p50 latency: 3740 usec
        p90 latency: 3986 usec
        p95 latency: 4005 usec
        p99 latency: 4070 usec
        Avg HTTP time: 3671 usec (send 141 usec + response wait 3529 usec + receive 1 usec)
      Server:
        Inference count: 1616
        Execution count: 1616
        Successful request count: 1616
        Avg request latency: 3075 usec (overhead 294 usec + queue 190 usec + compute 2591 usec)
    
      Composing models:
      add_sub_1, version:
          Inference count: 1616
          Execution count: 1616
          Successful request count: 1616
          Avg request latency: 1492 usec (overhead 166 usec + queue 88 usec + compute input 164 usec + compute infer 787 usec + compute output 286 usec)
    
      add_sub_2, version:
          Inference count: 1616
          Execution count: 1616
          Successful request count: 1616
          Avg request latency: 1622 usec (overhead 167 usec + queue 102 usec + compute input 164 usec + compute infer 815 usec + compute output 374 usec)
    
    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 269.4 infer/sec, latency 3703 usec
    

    Verbose CSV

    Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
    1,269.4,141,962,6,328,1603,660,1,3740,3986,4005,4070,3703,142,3529
    

    Sequence Model

    PA Output

    *** Measurement Settings ***
      Batch size: 1
      Using "time_windows" mode for stabilization
      Measurement window: 5000 msec
      Using asynchronous calls for inference
      Stabilizing using average latency
    
    Request concurrency: 1
      Pass [1] throughput: 1541.8 infer/sec. Avg latency: 626 usec (std 33 usec)
      Pass [2] throughput: 1515.2 infer/sec. Avg latency: 636 usec (std 51 usec)
      Pass [3] throughput: 1510.4 infer/sec. Avg latency: 638 usec (std 49 usec)
      Client:
        Request count: 7552
        Sequence count: 379 (75.8 seq/sec)
        Throughput: 1510.4 infer/sec
        Avg latency: 638 usec (standard deviation 49 usec)
        p50 latency: 628 usec
        p90 latency: 677 usec
        p95 latency: 710 usec
        p99 latency: 877 usec
        Avg HTTP time: 601 usec (send 36 usec + response wait 565 usec + receive 0 usec)
      Server:
        Inference count: 9071
        Execution count: 9071
        Successful request count: 9071
        Avg request latency: 278 usec (overhead 58 usec + queue 59 usec + compute input 65 usec + compute infer 74 usec + compute output 21 usec)
    
    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 1510.4 infer/sec, latency 638 usec
    

    Verbose CSV

    Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
    1,1510.4,36,380,59,65,74,21,0,628,677,710,877,638,36,565
    

    Normal Model

    PA Output

    *** Measurement Settings ***
      Batch size: 1
      Using "time_windows" mode for stabilization
      Measurement window: 5000 msec
      Using synchronous calls for inference
      Stabilizing using average latency
    
    Request concurrency: 1
      Pass [1] throughput: 492 infer/sec. Avg latency: 2024 usec (std 212 usec)
      Pass [2] throughput: 490.8 infer/sec. Avg latency: 2030 usec (std 174 usec)
      Pass [3] throughput: 495 infer/sec. Avg latency: 2012 usec (std 212 usec)
      Client:
        Request count: 2475
        Throughput: 495 infer/sec
        Avg latency: 2012 usec (standard deviation 212 usec)
        p50 latency: 2063 usec
        p90 latency: 2221 usec
        p95 latency: 2272 usec
        p99 latency: 2332 usec
        Avg HTTP time: 1978 usec (send 141 usec + response wait 1836 usec + receive 1 usec)
      Server:
        Inference count: 2969
        Execution count: 2969
        Successful request count: 2969
        Avg request latency: 1376 usec (overhead 165 usec + queue 79 usec + compute input 156 usec + compute infer 772 usec + compute output 204 usec)
    
    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 495 infer/sec, latency 2012 usec
    

    Verbose CSV

    Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
    1,495,141,656,79,156,772,204,1,2063,2221,2272,2332,2012,142,1836
    
    opened by Tabrizian 12
  • Add Python client asyncio_infer for GRPC

    Add Python client asyncio_infer for GRPC

    This PR adds a simple asyncio method to the GRPC Python client. I didn't add examples or a corresponding HTTP method. Let me know if you want those.

    It also bumps up the version of the grpcio package. As of version 1.32.0, grpcio supports asyncio in its Python API. Increasing the version to 1.32.0 is the easiest way to implement this. If you don't want to bump the version, versions less than 1.32.0 can still use it via grpc.experimental.aio.

    Also, I'm not sold on the name asyncio_infer. The naming for inference would be a little confusing with infer, async_infer, asyncio_infer. Let me know what you think.

    enhancement 
    opened by tmccrmck 9
  • Support non-zero shm offset in tritonclient.utils.shared_memory

    Support non-zero shm offset in tritonclient.utils.shared_memory

    [Bug] Exception was raised when using offset !=0 in set_shared _memory of InferInput of Http client https://github.com/triton-inference-server/server/issues/3986

    [Feature] support offset for get_contents_as_numpy and set_shared_memory_region of shm lib https://github.com/triton-inference-server/server/issues/3987

    opened by remib-proovstation 8
  • Add count-based stabilization

    Add count-based stabilization

    The count based algorithm can be enabled by using the --window-size flag. I didn't reuse the -p flag because it still would require another flag to interpret the meaning of -p flag.

    opened by Tabrizian 8
  • Make grpc_client thread-safe by creating ModelInferRequest on the fly.

    Make grpc_client thread-safe by creating ModelInferRequest on the fly.

    As it has been noticed, grpc_client is not thread-safe. In this PR, I create ModelInferRequest every inference. Unexpected bahavior was gone when calling from different threads. Can anyone help reviewing this PR?

    opened by sijin-dm 6
  • Add MPI synchronization around measurements, keep each process inferencing until all are stable

    Add MPI synchronization around measurements, keep each process inferencing until all are stable

    Adds MPI synchronization before and after the Profile() function call in main(). This ensures that load managers are all sending requests to the server on all MPI processes the entire time each MPI process is measuring performance.

    opened by matthewkotila 5
  • Remove CurlGlobal Race Condition

    Remove CurlGlobal Race Condition

    Currently, the static CurlGlobal curl_global instance expects to be constructed after Error::Success, since the CurlGlobal constructor references Error::Success.

    This leads to a race condition in Centos7, as the two global objects sit in separate translation units, http_client.cc and common.cc. This PR merges a fix to make the CurlGlobal object a singleton, so it is always initialized before being used.

    opened by dyastremsky 5
  • Add bfloat to client utils

    Add bfloat to client utils

    Add bfloat16 library to client utils. Server PR to put bfloat16 pip library into SDK container: https://github.com/triton-inference-server/server/pull/4521

    opened by jbkyang-nvi 4
  • Fix deadlock in concurrency thread configuration

    Fix deadlock in concurrency thread configuration

    There was a race condition where the thread creation and configuration (in ReconfigThreads) was timed in such a way that the worker thread got stuck waiting to be told what its concurrency should be. This was possible if the concurrency was updated right at (after?) this line: https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/concurrency_manager.cc#L343

    In that case the condition still evaluates to false, and the condition variable is not woken up by the change.

    The solution is to guard updates to the variable used in the condition_variable (concurrency_) behind the wake up mutex.

    opened by tgerdesnv 3
  • Python InferInput: set_data_from_numpy is not necessary in some cases.

    Python InferInput: set_data_from_numpy is not necessary in some cases.

    Objective

    In python, InferInput only support set_data_from_numpy. It means that users have to convert their data to numpy array even though it it not necessary.

    So, I added an interface that can set raw_content without converting to numpy array.

    Detail

    When model in triton server can process data that has bytes datatype, User don't have to converting it to numpy array.

    For example, if we use dali as preprocessor, dali can process bytes data.

    opened by RRoundTable 4
  • Also build source distribution for Python client

    Also build source distribution for Python client

    Closes https://github.com/triton-inference-server/server/issues/4661.

    With this PR, we're adding source distribution builds to the build script, which is necessary to add the package to conda-forge. The build_wheel.py script now creates a tritonclient*.tar.gz file additional to the tritonclient*.whl. It would make sense to rename the build script, for example, to build_dist.py, but I wanted this change to be as small as possible.

    I don't know what's your release process but I would really appreciate a new PyPI release with the source distribution included, so I can continue with bringing tritonclient to conda-forge. Thanks a lot for your help! 🙏

    opened by janjagusch 12
  • Adding DtoD support to CUDA shared memory feature

    Adding DtoD support to CUDA shared memory feature

    Hello world,

    This patch:

    • Adds DtoD CUDA memcopy support which allows to pass video frames decoded in CUDA memory to Triton.
    • Changes API of set_shared_memory_region function to accept CUDA memory beside numpy arrays.

    I'm not certain about changes in Python module API so please free to give me guidance on how to implement it better.

    opened by rarzumanyan 7
Owner
Triton Inference Server
Triton provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Learn more in https://github.com/triton-inference-server/server.
Triton Inference Server
LibVNCServer/LibVNCClient are cross-platform C libraries that allow you to easily implement VNC server or client functionality in your program.

LibVNCServer: A library for easy implementation of a VNC server. Copyright (C) 2001-2003 Johannes E. Schindelin If you already used LibVNCServer, you

null 888 Dec 30, 2022
Android-Login-Offline Simple login form in Java by Mahmoud Gaming

Android-Login-Offline Simple login form in Java by Mahmoud Gaming. I wanted to upload this project long time ago. This project is for experienced modd

Mahmoud Gaming 7 Mar 29, 2022
Examples and test programs I made while learning the DPDK.

The DPDK Examples (WIP) Description A small repository I will be using to store my progress and test programs from the DPDK, a kernel bypass library v

Christian Deacon 24 Dec 19, 2022
Examples for individual ROS2 functionalities inc. Subscribers, Publishers, Timers, Services, Parameters. ...

ROS2 examples This example package is meant to explore the possibilities of ROS2 from the point of view of current ROS1 features and how the ROS1 feat

Multi-robot Systems (MRS) group at Czech Technical University in Prague 50 Nov 17, 2022
Voicemeeter Remote API + Source Code Examples

Voicemeeter-SDK Voicemeeter Remote API + Source Code Examples Voicemeeter Remote API provides a set of functions to control Voicemeeter parameters, to

Vincent Burel 42 Dec 15, 2022
OTUS C++ course demo day examples

coroutines-epoll-example OTUS C++ course demo day examples Инструкция по сборке Необходимы следующие версии компонентов g++11 cmake >= 3.10 git clone

sdukshis 2 Dec 19, 2021
Simple server and client using python socket and declarative programming

Socket-programming Simple server and client using python socket and declarative programming How to use? open cmd and navigate to the location of the s

MAINAK CHAUDHURI 24 Dec 17, 2022
Online chess platform (client-server) in Python with StockFish API

PyChess Gra w szachy tylko w Pythonie :) Wymagania Python 3.8 Instalacja Wchodzimy i pobieramy najnowsze wydanie aplikacji. https://github.com/Rafixe

Rafał Hrabia 6 Oct 7, 2021
A library with common code used by libraries and tools around the libimobiledevice project

libimobiledevice-glue Library with common code used by the libraries and tools around the libimobiledevice project. Features The main functionality pr

libimobiledevice 41 Dec 23, 2022
Basic jam templates using Handmade libraries to get up and running quickly.

This is a selection of template projects to get up and running with for the Wheel Reinvention Jam. They are built on top of Handmade-inspired librarie

Handmade Network 16 Nov 27, 2022
A collection of C++ HTTP libraries including an easy to use HTTP server.

Proxygen: Facebook's C++ HTTP Libraries This project comprises the core C++ HTTP abstractions used at Facebook. Internally, it is used as the basis fo

Facebook 7.7k Jan 4, 2023
Simple useful interoperability tests for WebRTC libraries. If you are a WebRTC library developer we'd love to include you!

Overview This project aims to be a convenient location for WebRTC library developers to perform interoperability tests. Who can Participate The projec

Aaron Clauson 106 Dec 18, 2022
Tiny HTTP Server on C, using only standard libraries

hell_o Linux only. Tiny HTTP Server on C, using only standard libraries. It is unfinished yet, going to add working interface and rewrite handler late

null 3 Feb 1, 2022
Webdav-client-cpp - C++ WebDAV Client provides easy and convenient to work with WebDAV-servers.

WebDAV Client Package WebDAV Client provides easy and convenient to work with WebDAV-servers: Yandex.Disk Dropbox Google Drive Box 4shared ownCloud ..

Cloud Polis 103 Jan 1, 2023
VEngine-Client - vEngine: Official Client Module

━ S Y N O P S I S ━ Maintainer(s): Aviril, Tron vEngine is Next-Gen Sandbox-Engine being crafted in C++. In contrast to UE/Unity/ReverseEngineered-Mod

ᴠ : ꜱᴛᴜᴅɪᴏ 15 Sep 7, 2022
Pyth-client - client API for on-chain pyth programs

pyth-client client API for on-chain pyth programs Build Instructions # depends on openssl apt install libssl-dev # depends on libz apt install zlib1g

Pyth Network 115 Dec 16, 2022
This repository provides a C++ client SDK for Unleash that meets the Unleash Client Specifications.

Unleash Client SDK for C++ This repository provides a C++ client SDK for Unleash that meets the Unleash Client Specifications. Features The below tabl

Antonio Ruiz 4 Jan 30, 2022
Example Open Drone ID Linux transmitter for Bluetooth and Wi-Fi.

Open Drone ID transmitter example for Linux This program supports transmitting static drone ID data via Wi-Fi Beacon or Bluetooth on a desktop Linux P

Open Drone ID 15 Nov 26, 2022
RPI Pico WIFI via ESP-01S, LWESP, FreeRTOS, and MQTT example

RPIPicoRTOSMQTT RPI Pico WIFI via ESP-01S, LWESP, FreeRTOS, and MQTT example Demo code for RPI Pico using ESP-01S for wifi connection over uart. With

Dr Jon Durrant 2 Dec 2, 2021