CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU executio

Overview

CI PyPI version Gitter

CTranslate2

CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU execution. The goal is to provide comprehensive inference features and be the most efficient and cost-effective solution to deploy standard neural machine translation systems such as Transformer models.

The project is production-oriented and comes with backward compatibility guarantees, but it also includes experimental features related to model compression and inference acceleration.

Table of contents

  1. Key features
  2. Quickstart
  3. Installation
  4. Converting models
  5. Translating
  6. Environment variables
  7. Building
  8. Testing
  9. Benchmarks
  10. Frequently asked questions

Key features

  • Fast and efficient execution on CPU and GPU
    The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks.
  • Quantization and reduced precision
    The model serialization and computation support weights with reduced precision: 16-bit floating points (FP16), 16-bit integers, and 8-bit integers.
  • Multiple CPU architectures support
    The project supports x86-64 and ARM64 processors and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, and Apple Accelerate.
  • Automatic CPU detection and code dispatch
    One binary can include multiple backends (e.g. Intel MKL and oneDNN) and instruction set architectures (e.g. AVX, AVX2) that are automatically selected at runtime based on the CPU information.
  • Parallel and asynchronous translations
    Translations can be run efficiently in parallel and asynchronously using multiple GPUs or CPU cores.
  • Dynamic memory usage
    The memory usage changes dynamically depending on the request size while still meeting performance requirements thanks to caching allocators on both CPU and GPU.
  • Lightweight on disk
    Models can be quantized below 100MB with minimal accuracy loss. A full featured Docker image supporting GPU and CPU requires less than 400MB.
  • Simple integration
    The project has few dependencies and exposes translation APIs in Python and C++ to cover most integration needs.
  • Interactive decoding
    Advanced decoding features allow autocompleting a partial translation and returning alternatives at a specific location in the translation.

Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.

Supported decoding options

The translation API supports several decoding options:

  • decoding with greedy or beam search
  • random sampling from the output distribution
  • translating with a known target prefix
  • returning alternatives at a specific location in the target
  • constraining the decoding length
  • returning multiple translation hypotheses
  • returning attention vectors
  • approximating the generation using a pre-compiled vocabulary map
  • replacing unknown target tokens by source tokens with the highest attention

See the Decoding documentation for examples.

Quickstart

The steps below assume a Linux OS and a Python installation (3.6 or above).

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Convert a model trained with OpenNMT-py or OpenNMT-tf, for example the pretrained Transformer model (choose one of the two models):

a. OpenNMT-py

pip install OpenNMT-py

wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
tar xf transformer-ende-wmt-pyOnmt.tar.gz

ct2-opennmt-py-converter --model_path averaged-10-epoch.pt --model_spec TransformerBase \
    --output_dir ende_ctranslate2

b. OpenNMT-tf

pip install OpenNMT-tf

wget https://s3.amazonaws.com/opennmt-models/averaged-ende-export500k-v2.tar.gz
tar xf averaged-ende-export500k-v2.tar.gz

ct2-opennmt-tf-converter --model_path averaged-ende-export500k-v2 --model_spec TransformerBase \
    --output_dir ende_ctranslate2

3. Translate tokenized inputs, for example with the Python API:

import ctranslate2
translator = ctranslate2.Translator("ende_ctranslate2/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

Installation

Python package

Python packages are published on PyPI for Linux and macOS:

pip install ctranslate2

All software dependencies are included in the package, including CUDA libraries for GPU support on Linux. The macOS version only supports CPU execution.

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.6
  • pip version: >= 19.3
  • GPU driver version: >= 418.39

Docker images

The opennmt/ctranslate2 repository contains images for multiple Linux distributions, with or without GPU support:

docker pull opennmt/ctranslate2:latest-ubuntu18-cuda11.0

The images include:

  • a translation client to directly translate files
  • Python 3 packages
  • libctranslate2.so library development files

Manual compilation

See Building.

Converting models

The core CTranslate2 implementation is framework agnostic. The framework specific logic is moved to a conversion step that serializes trained models into a simple binary format.

The following frameworks and models are currently supported:

OpenNMT-tf OpenNMT-py
Transformer (Vaswani et al. 2017)
+ relative position representations (Shaw et al. 2018)

If you are using a model that is not listed above, consider opening an issue to discuss future integration.

Conversion scripts are parts of the Python package and should be run in the same environment as the selected training framework:

  • ct2-opennmt-py-converter
  • ct2-opennmt-tf-converter

The converter Python API can also be used to convert Transformer models with any number of layers, hidden dimensions, and attention heads.

Integrated model conversion

Models can also be converted directly from the supported training frameworks. See their documentation:

Quantization and reduced precision

The converters support reducing the weights precision to save on space and possibly accelerate the model execution. The --quantization option accepts the following values:

  • int8
  • int16
  • float16

When loading a quantized model, the library tries to use the same type for computation. If the current platform or backend do not support optimized execution for this computation type (e.g. int16 is not optimized on GPU), then the library converts the model weights to another optimized type. The tables below document the fallback types:

On CPU:

Model int8 int16 float16
Intel int8 int16 float
other int8 int8 float

(This table only applies for prebuilt binaries or when compiling with both Intel MKL and oneDNN backends.)

On GPU:

Compute Capability int8 int16 float16
>= 7.0 int8 float16 float16
6.1 int8 float float
<= 6.0 float float float

Notes:

  • The computation type can also be changed when creating a translation instance by setting the --compute_type argument.
  • Integer quantization is only applied for GEMM-based layers and embeddings.

Adding converters

Each converter should populate a model specification with trained weights coming from an existing model. The model specification declares the variable names and layout expected by the CTranslate2 core engine.

See the existing converters implementation which could be used as a template.

Translating

The examples use the English-German model converted in the Quickstart. This model requires a SentencePiece tokenization.

With the translation client

echo "▁H ello ▁world !" | docker run --gpus=all -i --rm -v $PWD:/data \
    opennmt/ctranslate2:latest-ubuntu18-cuda11.0 --model /data/ende_ctranslate2 --device cuda

See docker run --rm opennmt/ctranslate2:latest-ubuntu18-cuda11.0 --help for additional options.

With the Python API

import ctranslate2
translator = ctranslate2.Translator("ende_ctranslate2/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

See the Python reference for more advanced usages.

With the C++ API

#include <iostream>
#include <ctranslate2/translator.h>

int main() {
  ctranslate2::Translator translator("ende_ctranslate2/");
  ctranslate2::TranslationResult result = translator.translate({"▁H", "ello", "▁world", "!"});

  for (const auto& token : result.output())
    std::cout << token << ' ';
  std::cout << std::endl;
  return 0;
}

See the Translator class for more advanced usages, and the TranslatorPool class for running translations in parallel.

Environment variables

Some environment variables can be configured to customize the execution:

  • CT2_CUDA_ALLOCATOR: Select the CUDA memory allocator. Possible values are: cub_caching (default), cuda_malloc_async (requires CUDA >= 11.2).
  • CT2_CUDA_ALLOW_FP16: Allow using FP16 computation on GPU even if the device does not have efficient FP16 support.
  • CT2_CUDA_CACHING_ALLOCATOR_CONFIG: Tune the CUDA caching allocator (see Performance).
  • CT2_FORCE_CPU_ISA: Force CTranslate2 to select a specific instruction set architecture (ISA). Possible values are: GENERIC, AVX, AVX2. Note: this does not impact backend libraries (such as Intel MKL) which usually have their own environment variables to configure ISA dispatching.
  • CT2_TRANSLATORS_CORE_OFFSET: If set to a non negative value, parallel translators are pinned to cores in the range [offset, offset + inter_threads]. Requires compilation with -DOPENMP_RUNTIME=NONE.
  • CT2_USE_EXPERIMENTAL_PACKED_GEMM: Enable the packed GEMM API for Intel MKL (see Performance).
  • CT2_USE_MKL: Force CTranslate2 to use (or not) Intel MKL. By default, the runtime automatically decides whether to use Intel MKL or not based on the CPU vendor.
  • CT2_VERBOSE: Enable some verbose logs to help debugging the run configuration.

Building

Docker images

The Docker images build all translation clients presented in Translating. The build command should be run from the project root directory, e.g.:

docker build -t opennmt/ctranslate2:latest-ubuntu18 -f docker/Dockerfile.ubuntu .

When building GPU images, the CUDA version can be selected with --build-arg CUDA_VERSION=11.0.

See the docker/ directory for available images.

Build options

The project uses CMake for compilation. The following options can be set with -DOPTION=VALUE:

CMake option Accepted values (default in bold) Description
BUILD_TESTS OFF, ON Compiles the tests
CMAKE_CXX_FLAGS compiler flags Defines additional compiler flags
ENABLE_CPU_DISPATCH OFF, ON Compiles CPU kernels for multiple ISA and dispatches at runtime (should be disabled when explicitly targeting an architecture with the -march compilation flag)
ENABLE_PROFILING OFF, ON Enables the integrated profiler (usually disabled in production builds)
LIB_ONLY OFF, ON Disables the translation client
OPENMP_RUNTIME INTEL, COMP, NONE Selects or disables the OpenMP runtime (INTEL: Intel OpenMP; COMP: OpenMP runtime provided by the compiler; NONE: no OpenMP runtime)
WITH_CUDA OFF, ON Compiles with the CUDA backend
WITH_DNNL OFF, ON Compiles with the oneDNN backend (a.k.a. DNNL)
WITH_MKL OFF, ON Compiles with the Intel MKL backend
WITH_ACCELERATE OFF, ON Compiles with the Apple Accelerate backend
WITH_OPENBLAS OFF, ON Compiles with the OpenBLAS backend

Some build options require external dependencies:

  • -DWITH_MKL=ON requires:
  • -DWITH_DNNL=ON requires:
  • -DWITH_ACCELERATE=ON requires:
  • -DWITH_OPENBLAS=ON requires:
  • -DWITH_CUDA=ON requires:

Multiple backends can be enabled for a single build. When building with both Intel MKL and oneDNN, the backend will be selected at runtime based on the CPU information.

Example (Ubuntu)

Install Intel MKL (optional for GPU only builds)

Use the following instructions to install Intel MKL:

/etc/apt/sources.list.d/oneAPI.list' sudo apt-get update sudo apt-get install intel-oneapi-mkl-devel ">
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo sh -c 'echo "deb https://apt.repos.intel.com/oneapi all main" > /etc/apt/sources.list.d/oneAPI.list'
sudo apt-get update
sudo apt-get install intel-oneapi-mkl-devel

See the Intel MKL documentation for other installation methods.

Install CUDA (optional for CPU only builds)

See the NVIDIA documentation for information on how to download and install CUDA.

Compile

Under the project root, run the following commands:

git submodule update --init
mkdir build && cd build
cmake -DWITH_MKL=ON -DWITH_CUDA=ON ..
make -j4

(If you did not install one of Intel MKL or CUDA, set its corresponding flag to OFF in the CMake command line.)

These steps should produce the cli/translate binary. You can try it with the model converted in the Quickstart section:

$ echo "▁H ello ▁world !" | ./cli/translate --model ende_ctranslate2/ --device auto
▁Hallo ▁Welt !

Testing

C++

To enable the tests, you should configure the project with cmake -DBUILD_TESTS=ON. The binary tests/ctranslate2_test runs all tests using Google Test. It expects the path to the test data as argument:

./tests/ctranslate2_test ../tests/data

Python

# Install the CTranslate2 library.
cd build && make install && cd ..

# Build and install the Python wheel.
cd python
pip install -r install_requirements.txt
python setup.py bdist_wheel
pip install dist/*.whl

# Run the tests with pytest.
pip install -r tests/requirements.txt
pytest tests/test.py

Depending on your build configuration, you might need to set LD_LIBRARY_PATH if missing libraries are reported when running tests/test.py.

Benchmarks

We compare CTranslate2 with OpenNMT-py and OpenNMT-tf on their pretrained English-German Transformer models (available on the website). For this benchmark, CTranslate2 models are using the weights of the OpenNMT-py model.

Model size

Model size
OpenNMT-py 542MB
OpenNMT-tf 367MB
CTranslate2 364MB
- int16 187MB
- float16 182MB
- int8 100MB

CTranslate2 models are generally lighter and can go as low as 100MB when quantized to int8. This also results in a fast loading time and noticeable lower memory usage during runtime.

Results

We translate the test set newstest2014 and report:

  • the number of target tokens generated per second (higher is better)
  • the maximum memory usage (lower is better)
  • the BLEU score of the detokenized output (higher is better)

Translations are running beam search with a size of 4 and a maximum batch size of 32.

See the directory tools/benchmark for more details about the benchmark procedure and how to run it. Also see the Performance document to further improve CTranslate2 performance.

Please note that the results presented below are only valid for the configuration used during this benchmark: absolute and relative performance may change with different settings.

CPU

Tokens per second Max. memory BLEU
OpenNMT-tf 2.14.0 (with TensorFlow 2.4.0) 279.3 2308MB 26.93
OpenNMT-py 2.0.0 (with PyTorch 1.7.0) 292.9 1840MB 26.77
- int8 383.3 1784MB 26.86
CTranslate2 1.17.0 593.2 970MB 26.77
- int16 777.2 718MB 26.84
- int8 921.5 635MB 26.92
- int8 + vmap 1143.4 621MB 26.75

Executed with 4 threads on a c5.metal Amazon EC2 instance equipped with an Intel(R) Xeon(R) Platinum 8275CL CPU.

GPU

Tokens per second Max. GPU memory Max. CPU memory BLEU
OpenNMT-tf 2.14.0 (with TensorFlow 2.4.0) 1753.4 4958MB 2525MB 26.93
OpenNMT-py 2.0.0 (with PyTorch 1.7.0) 1189.4 2838MB 2666MB 26.77
CTranslate2 1.17.0 2721.1 1164MB 954MB 26.77
- int8 3710.0 882MB 541MB 26.86
- float16 3965.8 924MB 590MB 26.75
- float16 + local sorting 4869.4 1148MB 591MB 26.75

Executed with CUDA 11.0 on a g4dn.xlarge Amazon EC2 instance equipped with a NVIDIA T4 GPU (driver version: 450.80.02).

Frequently asked questions

How does it relate to the original CTranslate project?

The original CTranslate project shares a similar goal which is to provide a custom execution engine for OpenNMT models that is lightweight and fast. However, it has some limitations that were hard to overcome:

  • a strong dependency on LuaTorch and OpenNMT-lua, which are now both deprecated in favor of other toolkits;
  • a direct reliance on Eigen, which introduces heavy templating and a limited GPU support.

CTranslate2 addresses these issues in several ways:

  • the core implementation is framework agnostic, moving the framework specific logic to a model conversion step;
  • the internal operators follow the ONNX specifications as much as possible for better future-proofing;
  • the call to external libraries (Intel MKL, cuBLAS, etc.) occurs as late as possible in the execution to not rely on a library specific logic.

What is the state of this project?

The implementation has been generously tested in production environment so people can rely on it in their application. The project versioning follows Semantic Versioning 2.0.0. The following APIs are covered by backward compatibility guarantees:

  • Converted models
  • Python converters options
  • Python symbols:
    • ctranslate2.Translator
    • ctranslate2.converters.OpenNMTPyConverter
    • ctranslate2.converters.OpenNMTTFConverter
  • C++ symbols:
    • ctranslate2::models::Model
    • ctranslate2::TranslationOptions
    • ctranslate2::TranslationResult
    • ctranslate2::Translator
    • ctranslate2::TranslatorPool
  • C++ translation client options

Other APIs are expected to evolve to increase efficiency, genericity, and model support.

Why and when should I use this implementation instead of PyTorch or TensorFlow?

Here are some scenarios where this project could be used:

  • You want to accelarate standard translation models for production usage, especially on CPUs.
  • You need to embed translation models in an existing C++ application without adding large dependencies.
  • Your application requires custom threading and memory usage control.
  • You want to reduce the model size on disk and/or memory.

However, you should probably not use this project when:

  • You want to train custom architectures not covered by this project.
  • You see no value in the key features listed at the top of this document.

What hardware is supported?

CPU

CTranslate2 supports x86-64 and ARM64 processors. It includes optimizations for AVX, AVX2, and NEON and supports multiple BLAS backends that should be selected based on the target platform (see Building).

Prebuilt binaries are designed to run on any x86-64 processors supporting at least SSE 4.2. The binaries implement runtime dispatch to select the best backend and instruction set architecture (ISA) for the platform. In particular, they are compiled with both Intel MKL and oneDNN so that Intel MKL is only used on Intel processors where it performs best, whereas oneDNN is used on other x86-64 processors such as AMD.

GPU

CTranslate2 supports NVIDIA GPUs with a Compute Capability greater or equal to 3.0 (Kepler). FP16 execution requires a Compute Capability greater or equal to 7.0.

The driver requirement depends on the CUDA version. See the CUDA Compatibility guide for more information.

What are the known limitations?

The current approach only exports the weights from existing models and redefines the computation graph via the code. This implies a strong assumption of the graph architecture executed by the original framework.

We are actively looking to ease this assumption by supporting ONNX as model parts.

What are the future plans?

There are many ways to make this project better and even faster. See the open issues for an overview of current and planned features. Here are some things we would like to get to:

  • Support of running ONNX graphs

What is the difference between intra_threads and inter_threads?

  • intra_threads is the number of OpenMP threads that is used per translation: increase this value to decrease the latency.
  • inter_threads is the maximum number of CPU translations executed in parallel: increase this value to increase the throughput. Even though the model data are shared, this execution mode will increase the memory usage as some internal buffers are duplicated for thread safety.

The total number of computing threads launched by the process is summarized by this formula:

num_threads = inter_threads * intra_threads

Note that these options are only defined for CPU translation and are forced to 1 when executing on GPU. Parallel translations on GPU require multiple GPUs. See the option device_index that accepts multiple device IDs.

Do you provide a translation server?

The OpenNMT-py REST server is able to serve CTranslate2 models. See the code integration to learn more.

How do I generate a vocabulary mapping file?

See here.

Comments
  • How to install Ctranslate2 without Docker

    How to install Ctranslate2 without Docker

    Hi, I'd like to install the Ctranslate2 module without using a Docker. Is it possible? Are there any scripts for this? I've tried generating a shell script from the dockerfile but it gives me some errors. Thanks

    question 
    opened by anderleich 24
  • Question about benchmark report (batch size impact)

    Question about benchmark report (batch size impact)

    Since the batch_size is set as 32, does it mean tokens per second include 32 batch size output? Would it be different from batch size as 1 (common in production)? If it is different, do we have any result for that? Thanks a lot.

    question 
    opened by rossbucky 20
  • Plans to support model trained in fairseq

    Plans to support model trained in fairseq

    Can you please support a model trained in fairseq, else since it is torch can it be imported to infer and quantized.

    Also the model sizes are of transformer_big? Since if it is transformer _base it would be around half of the score. Please consider distilling the model into smaller model that would help for inference and size.

    help wanted 
    opened by gvskalyan 18
  • Convert m2m100 with several languages.

    Convert m2m100 with several languages.

    Hello, is it possible to convert the m2m100 model with several output languages? Indeed, you have to specify the input and output languages in the converter.

    Or is it possible to have a sample script that allows to convert m2m100. I am new in this library.

    question 
    opened by Jourdelune 17
  • Translations differ

    Translations differ

    Hi,

    I've recently realized that my converted OpenNMT-py model is not returning the same translation for a sentence when compared to the original OpenNMT-py model. Model architecture is the default base Transformer. I'm using Ctranslate2 version 1.18.1.

    It seems Ctranslate2 model is merging different hypotheses into the final result, thus, inserting some word repetitions and synonyms in the translation.

    Example result:

    OpenNMT-py: 2013ko abuztutik aurrera, ikertzaile txinatarren zenbait taldek instalazioetan egindako CRISPR bidezko lehen edizio arrakastatsuak dokumentatu zituzten.

    Ctranslate2: 2013ko abuztutik aurrera, Txinako zenbait ikertzailbatzuek instalazioetan egindako CRISPR bidezko lehen edizio genetiko arrakastatsuak dokumentatu zituzten.

    In the second case it should be either zenbait ikertzailek or ikertzaile batzuek, but it is combining both of them, even truncating some words.

    This is the configuration file for the server:

    {
        "models_root": "/absolute/path/to/model/dir/",
        "models": [
            {
                "id": 100,
                "ct2_model": "BEST_MODEL.pt_ctrans",
                "model": "BEST_MODEL.pt",
                "timeout": 600,
                "on_timeout": "to_cpu",
                "load": true,
                "opt": {
                    "gpu": 0,
                    "batch_size": 64,
                    "beam_size": 5,
                    "max_length": 200
                },
                "tokenizer": {
                    "type": "pyonmttok",
                    "mode": "conservative",
                    "params": {
                        "bpe_model_path": "/absolute/path/codes.bpe",
                        "joiner": "\uffed",
                        "joiner_annotate": true,
                        "case_markup": true
                    }
                }
            }
        ]
    }
    

    Is this a known issue?

    Thanks

    bug 
    opened by anderleich 17
  • Support ARM 64-bit architecture

    Support ARM 64-bit architecture

    Closes #221

    • Add OpenBLAS and Apple Accelerate GEMM backends (DNNL backend also works on ARM)
    • Add vectorized kernels using NEON
    • Add CI pipelines for aarch64 using GCC cross compiler and QEMU user mode emulation
    opened by keichi 17
  • 8-bit translation raises error CUBLAS_STATUS_NOT_SUPPORTED on Ampere GPU

    8-bit translation raises error CUBLAS_STATUS_NOT_SUPPORTED on Ampere GPU

    Hi, I install ctranslate2 by pip. Run: translator.translate_batch(), I get error: RuntimeError: cuBLAS failed with status CUBLAS_STATUS_NOT_SUPPORTED Then, I compile with CUDA from source, get error:

    [  1%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_quantize_gpu.cu.o
    nvcc fatal   : Unsupported gpu architecture 'compute_86'
    CMake Error at ctranslate2_generated_quantize_gpu.cu.o.Release.cmake:220 (message):
      Error generating
      /host/mercube/running/CTranslate2/build/CMakeFiles/ctranslate2.dir/src/ops/./ctranslate2_generated_quantize_gpu.cu.o
    
    
    make[2]: *** [CMakeFiles/ctranslate2.dir/build.make:121: CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_quantize_gpu.cu.o] Error 1
    make[1]: *** [CMakeFiles/Makefile2:95: CMakeFiles/ctranslate2.dir/all] Error 2
    make: *** [Makefile:130: all] Error 2
    

    The GPU is RTX3060, the version of CUDA is 11.0. I've run pytorch model and it's worked. Does CTranslate2 support RTX30 series?

    bug gpu 
    opened by beichao1314 15
  • M2M100 wrong Translation

    M2M100 wrong Translation

    Hello, I have a bug with m2m100 and ctranslate2. When I try to translate this text from English to French, this is what I get: Oh, ICQ! that's when dinosaurs were? Result: That's it, that's it, that's it, that's it, that's it, that's it, that's it, that's it, that's it, that's it.

    I report it as due to ctranslate2 because with the same model in hugging face I don't have this problem.

    opened by Jourdelune 14
  • Question about compute_type

    Question about compute_type

    I found that when I set the compute_type to "float16", or "float", their latecy are all larger than "default".

    I was using V100, Compute Capability should be ok.

    translator = ctranslate2.Translator(
        model_path: str                 # Path to the CTranslate2 model directory.
        device: str = "cuda",            # The device to use: "cpu", "cuda", or "auto".
        device_index: int = 0,          # The index of the device to place this translator on.
        compute_type: str = "default"   # The computation type: "default", "int8", "int16", "float16", or "float",
                                        # or a dict mapping a device to a computation type.
    
    

    does this mean that if I use float16 instead of "default", it's slow due to I actually convert some int operation to float16 operation?

        default: {
          // By default, the compute type is the type of the saved model weights.
          // To ensure that any models can be loaded, we enable the fallback.
    
          ComputeType inferred_compute_type = ComputeType::FLOAT;
          switch (weights_type) {
          case DataType::INT16:
            inferred_compute_type = ComputeType::INT16;
            break;
          case DataType::INT8:
            inferred_compute_type = ComputeType::INT8;
            break;
          case DataType::FLOAT16:
            inferred_compute_type = ComputeType::FLOAT16;
            break;
          default:
            inferred_compute_type = ComputeType::FLOAT;
            break;
          }
    
    question 
    opened by WangYongzhao 14
  • compiled client doesn't work as expected in Windows

    compiled client doesn't work as expected in Windows

    So I managed to compile everything with MSVC but I can't figure out why the client doesn't translate as expected. With short sentences containing only a few words (~10), it seems to be working fine. With longer sentences, I get very short, truncated, and irrelevant translations or just a single irrelevant word. Under OS X, it works wonderfully, no matter the length of the sentence. In both systems I'm using the same converted tf model and the same sentencepiece model. The only weird thing I can notice is that the special underscore character from sentencepiece in shared_vocabulary.txt has encoding issues under Windows and appears as an empty box.

    bug 
    opened by panosk 14
  • Docker and import Python

    Docker and import Python

    Hi,

    If I want to use OpenNMT server with ctranslate2 how can I install Ctranslate2 in order to work? It should be in the virtual environment, shouldn't it? Is it possible to install it with the docker or do I need to install it manually?

    Thanks

    opened by anderleich 13
  • Comparisons to TensorFlow and pytorch impls make less sense than comparisons to onnxruntime imp,s

    Comparisons to TensorFlow and pytorch impls make less sense than comparisons to onnxruntime imp,s

    In my experience (though, for other models, CNN and visual transformer ones) onnxruntime provides better performance for inference on CPU than pytorch, tensor and (surprisingly) Apache TVM (CPU and Vulkan GPU backends, the models were "optimized"). So it may make little sense to compare CTranslate2 (which is targeted for fast inference) to framework targeted for convenient model design and training.

    enhancement 
    opened by KOLANICH 4
  • Support encoder-only models like BERT

    Support encoder-only models like BERT

    We should look into supporting encoder-only models like BERT via a new high-level class ctranslate2.Encoder.

    BERT models are usually followed by various prediction heads depending on the task. We can't support all these tasks so the scope of the development should be restricted to the Transformer model itself. The module should return the last hidden state and/or the pooled output and the predictions heads should be run separately with the initial framework (e.g. PyTorch).

    enhancement 
    opened by guillaumekln 2
  • The number of tokens in a batch may exceeds max_batch_size when batch_type is

    The number of tokens in a batch may exceeds max_batch_size when batch_type is "tokens".

    Hello!

    Using the score_batch feature with batch_type=tokens, I found that the get_batch_ size_increment function increments the batch size by the length of each example, rather than the longest example in the given sentences. This may produce a batch with more tokens than max_batch_size. The actual size may be max_batch_size + the number of padding tokens.

    https://github.com/OpenNMT/CTranslate2/blob/0455e1fe1e4a57e00d13d1b13ad44a61545ccfe9/src/batch_reader.cc#L22

    As a small experiment, I compared the input file sorted by sentence length in reverse order with the unsorted original. And I found that the process ended successfully with the one sorted, even though the original unsorted one produced an out-of-memory error using the same max_batch_size. I think it happened because the actual number of tokens in the batch was so large because the unsorted input file had more padding tokens.

    enhancement 
    opened by TomokiMatsuno 1
  • Support BART models for classification

    Support BART models for classification

    Hello again,

    I'm trying to convert this adaptation of Bart Large MNLI: https://huggingface.co/joeddav/bart-large-mnli-yahoo-answers

    It returns the following error (but the base Bart Large MNLI model works well):

    Traceback (most recent call last):
      File "/home/ubuntu/.local/bin/./ct2-transformers-converter", line 8, in <module>
        sys.exit(main())
      File "/home/ubuntu/.local/lib/python3.10/site-packages/ctranslate2/converters/transformers.py", line 445, in main
        converter.convert_from_args(args)
      File "/home/ubuntu/.local/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
        return self.convert(
      File "/home/ubuntu/.local/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 89, in convert
        model_spec = self._load()
      File "/home/ubuntu/.local/lib/python3.10/site-packages/ctranslate2/converters/transformers.py", line 62, in _load
        return loader(self._model_name_or_path)
      File "/home/ubuntu/.local/lib/python3.10/site-packages/ctranslate2/converters/transformers.py", line 85, in __call__
        spec = self.get_model_spec(model)
      File "/home/ubuntu/.local/lib/python3.10/site-packages/ctranslate2/converters/transformers.py", line 146, in get_model_spec
        pre_norm=model.config.normalize_before,
      File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 257, in __getattribute__
        return super().__getattribute__(key)
    AttributeError: 'BartConfig' object has no attribute 'normalize_before'
    

    Thanks in advance!

    enhancement 
    opened by juliensalinas 5
Releases(v3.2.0)
  • v3.2.0(Dec 12, 2022)

    New features

    • Add decoding option suppress_sequences to prevent specific sequences of tokens from being generated
    • Add decoding option end_token to stop the decoding on a different token than the model EOS token
    • Allow returning multiple random hypotheses from greedy search + random sampling when setting num_hypotheses > 1

    Fixes and improvements

    • Improve support for batch generation with the Whisper model:
      • Improve performance of batch generation with a context (we only require the prompts to have the same length, which is easily done by adapting the number of previous text tokens)
      • Support batch mode for option return_no_speech_prob
      • Support cases where some prompts in the batch have the token <|notimestamps|> but not others
    • Enable the Conv1D layer in more Python wheels:
      • macOS x64 (using oneDNN)
      • macOS ARM64 (using a custom implementation)
      • Linux AArch64 (using a custom implementation)
    • Update the OpenNMT-py converter to support the latest checkpoint structure
    • Generalize the TransformerSpec constructor to accept arbitrary encoder and decoder specifications
    • Remove the global compilation flag -ffast-math which introduces unwanted side effects and enable it only for the layer norm CPU kernel where it is actually useful
    • Fix CMake error on Windows when setting -DOPENMP_RUNTIME=COMP
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Nov 29, 2022)

    Changes

    • The input prompt is no longer included in the result of Whisper.generate as it is usually not useful in a transcription loop
    • The default beam size in Whisper.generate is updated from 1 to 5 to match the default value in openai/whisper
    • Generation options min_length and no_repeat_ngram_size now penalize the logits instead of the log probs which may change some scores
    • Raise a deprecation warning when reading the TranslationResult object as a list of dictionaries

    New features

    • Allow configuring the C++ logs from Python with the function ctranslate2.set_log_level
    • Implement the timestamp decoding rules when the Whisper prompt does not include the token <|notimestamps|>
    • Add option return_no_speech_prob to the method Whisper.generate for the result to include the probability of the no speech token

    Fixes and improvements

    • Improve performance of the Whisper model when generating with a context
    • Fix timestamp tokens in the Whisper vocabulary to use the correct format (<|X.XX|>)
    • Fix AVX and NEON log functions to return -inf on log(0) instead of NaN
    • When info logs are enabled, log the system configuration only when the first model is loaded and not immediately when the library is loaded
    • Define a LogitsProcessor abstract class to apply arbitrary updates to the logits during decoding
    • Update oneDNN to 2.7.2
    Source code(tar.gz)
    Source code(zip)
  • v3.0.2(Nov 14, 2022)

  • v3.0.1(Nov 10, 2022)

  • v3.0.0(Nov 7, 2022)

    This major version integrates the Whisper speech recognition model published by OpenAI. It also introduces some breaking changes to remove deprecated usages and simplify some modules.

    Breaking changes

    General

    • Remove option normalize_scores: the scores are now always divided by pow(length, length_penalty) with length_penalty defaulting to 1
    • Remove option allow_early_exit: the beam search now exits early only when no penalties are used

    Python

    • Rename some classes:
      • OpenNMTTFConverterV2 -> OpenNMTTFConverter
      • TranslationStats -> ExecutionStats
    • Remove compatibility for reading ScoringResult as a list of scores: the scores can be accessed with the attribute log_probs
    • Remove compatibility for reading ExecutionStats as a tuple
    • Remove support for deprecated Python version 3.6

    CLI

    • Rename the client executable translate to a more specific name ct2-translator

    C++

    • Rename or remove some classes and methods:
      • TranslationStats -> ExecutionStats
      • GeneratorPool -> Generator
      • TranslatorPool -> Translator
      • TranslatorPool::consume_* -> Translator::translate_*
      • TranslatorPool::consume_stream -> removed
      • TranslatorPool::score_stream -> removed
    • Remove support for building with CUDA 10

    New features

    • Integrate the Whisper speech recognition model published by OpenAI
    • Support conversion of models trained with OpenNMT-py V3
    • Add method Generator.forward_batch to get the full model output for a batch of sequences
    • Add Python class StorageView to expose C++ methods taking or returning N-dimensional arrays: the class implements the array interface for interoperability with Numpy and PyTorch
    • Add a new configuration file config.json in the model directory that contains non structual model parameters (e.g. related to the input, the vocabulary, etc.)
    • Implement the Conv1D layer and operator on CPU and GPU (using oneDNN and cuDNN respectively)
    • [C++] Allow registration of external models with models::ModelFactory

    Fixes and improvements

    • Fix conversion of models that use biases only for some QKV projections but not for all
    • Fuse masking of the output log probs by aggregating disabled tokens from all related options: disable_unk, min_length, no_repeat_ngram_size, etc.
    • Reduce the layer norm epsilon value on GPU to 1e-5 to match the default value in PyTorch
    • Move some Transformer model attributes under the encoder/decoder scopes to simplify loading
    • Redesign the ReplicaPool base class to simplify adding new classes with multiple model workers
    • Compile the library with C++17
    • Update oneDNN to 2.7.1
    • Update oneMKL to 2022.2
    • Update pybind11 to 2.10.1
    • Update cibuildwheel to 2.11.2
    Source code(tar.gz)
    Source code(zip)
  • v2.24.0(Oct 3, 2022)

    Changes

    • The Linux binaries now use the GNU OpenMP runtime instead of Intel OpenMP to workaround an initialization error on systems without /dev/shm

    Fixes and improvements

    • Fix a memory error when running random sampling on GPU
    • Optimize the model loading on multiple GPUs by copying the finalized model weights instead of reading the model from disk multiple times
    • In the methods Translator.translate_iterable and Translator.score_iterable, raise an error if the input iterables don't have the same length
    • Fix some compilation warnings
    Source code(tar.gz)
    Source code(zip)
  • v2.23.0(Sep 16, 2022)

    New features

    • Build wheels for Python 3.11

    Fixes and improvements

    • In beam search, get more candidates from the model output and replace finished hypotheses by these additional candidates
    • Fix possibly incorrect attention vectors returned from the beam search
    • Fix coverage penalty that was actually not applied
    • Fix crash when the beam size is larger than the vocabulary size
    • Add missing compilation flag -fvisibility=hidden when building the Python module
    • Update oneDNN to 2.6.2
    • Update OpenBLAS to 0.3.21
    Source code(tar.gz)
    Source code(zip)
  • v2.22.0(Sep 2, 2022)

    Changes

    • score_batch methods now return a list of ScoringResult instances instead of plain lists of probabilities. In most cases you should not need to update your code: the result object implements the methods __len__, __iter__, and __getitem__ so that it can still be used as a list.

    New features

    • Add methods to efficiently process long iterables:
      • Translator.translate_iterable
      • Translator.score_iterable
      • Generator.generate_iterable
      • Generator.score_iterable
    • Add decoding option min_alternative_expansion_prob to filter out unlikely alternatives in return_alternatives mode
    • Return ScoringResult instances from score_batch to include additional outputs. The current attributes are:
      • tokens: the list of tokens that were actually scored (including special tokens)
      • log_probs: the log probability of each scored token
    • Support running score_batch asynchronously by setting the asynchronous flag

    Fixes and improvements

    • Fix possibly incorrect results when using disable_unk or use_vmap with one of the following options:
      • min_decoding_length
      • no_repeat_ngram_size
      • prefix_bias_beta
      • repetition_penalty
    • Also pad the output layer during scoring to enable Tensor Cores
    • Improve the correctness of the model output probabilities when the output layer is padded
    • Skip translation when the NLLB input is empty (i.e. when the input only contains EOS and the language token)
    Source code(tar.gz)
    Source code(zip)
  • v2.21.1(Jul 29, 2022)

  • v2.21.0(Jul 27, 2022)

    New features

    • Support NLLB multilingual models via the Transformers converter
    • Support Pegasus summarization models via the Transformers converter

    Fixes and improvements

    • Do not stop decoding when the EOS token is coming from the user input: this is required by some text generation models like microsoft/DialoGPT where EOS is used as a separator
    • Fix conversion error for language models trained with OpenNMT-py
    • Fix conversion of models that are not using bias terms in the multi-head attention
    • Fix data type error when enabling the translation options return_alternatives and return_attention with a float16 model
    • Improve CPU performance of language models quantized to int8
    • Implement a new vectorized GELU operator on CPU
    • Raise a more explicit error when trying to convert a unsupported Fairseq model
    • Update pybind11 to 2.10.0
    Source code(tar.gz)
    Source code(zip)
  • v2.20.0(Jul 6, 2022)

    New features

    • Generation option no_repeat_ngram_size to prevent the repetitions of N-grams with a minimum size

    Fixes and improvements

    • Fix conversion of OpenNMT-tf models that use static position embeddings
    • Fix a segmentation fault in return_alternatives mode when the target prefix is longer than max_decoding_length
    • Fix inconsistent state of asynchronous results in Python when a runtime exception is raised
    • Remove <pad> token when converting MarianMT models from Transformers: this token is only used to start the decoder from a zero embedding, but it is not included in the original Marian model
    • Optimize CPU kernels with vectorized reduction of accumulated values
    • Do not modify the configuration passed to OpenNMTTFConverterV2.from_config
    • Improve Python classes documentation by listing members at the top
    Source code(tar.gz)
    Source code(zip)
  • v2.19.1(Jun 23, 2022)

    Fixes and improvements

    • Fix missing final bias in some MarianMT models converted from Transformers
    • Fix missing final layer normalization in OPT models converted from Transformers
    • Fix error when converting OpenNMT-tf V1 checkpoints with the new OpenNMT-tf converter
    • Reduce model conversion memory usage when the loaded weights are in FP16 and the model is converted with quantization
    • Add missing C++ type ctranslate2::float16_t in the public headers that is required to use some functions
    • Fix some Python typing annotations
    Source code(tar.gz)
    Source code(zip)
  • v2.19.0(Jun 8, 2022)

    New features

    • Support conversion of decoder-only Transformer models trained with OpenNMT-tf

    Fixes and improvements

    • Fix conversion error for Transformers' model facebook/bart-large-cnn
    • Fix crash when scoring empty sequences
    • Apply max_input_length after all special tokens have been added to the input
    • Clear the GPU memory cache when no new batches are immediately available for execution
    • Improve functions signature in the generated Python API documentation
    • Update oneDNN to 2.6
    • Update spdlog to 1.10.0
    • Update OpenBLAS to 0.3.20
    Source code(tar.gz)
    Source code(zip)
  • v2.18.0(May 23, 2022)

    New features

    • Support Meta's OPT models via the Transformers converter
    • Extend the Fairseq converter to support transformer_lm models

    Fixes and improvements

    • Fix conversion error for Marian's pre-norm Transformer models
    • Fix conversion error for Transformers' MarianMT models that are missing some configuration fields
    • Improve conversion speed of Marian models (optimize the generation of the sinusoidal position encodings)
    Source code(tar.gz)
    Source code(zip)
  • v2.17.0(May 9, 2022)

    New features

    • Add a converter for Hugging Face's Transformers. The following models are currently supported:
      • BART
      • M2M100
      • MarianMT
      • MBART
      • OpenAI GPT2
    • Revisit the OpenNMT-tf converter to better support custom models and configurations:
      • Extend the conversion script to accept the training configuration
      • Add a new converter class ctranslate2.converters.OpenNMTTFConverterV2
    • Move all documentation and guides to the website to improve navigation and clarity

    Fixes and improvements

    • In text generation, include the start token in the output if it is not the BOS token
    Source code(tar.gz)
    Source code(zip)
  • v2.16.0(Apr 28, 2022)

    New features

    • Initial support of language models:
      • Add a high-level class ctranslate2.Generator to generate text with language models
      • Add a converter for OpenAI GPT-2 models
      • Update the OpenNMT-py converter to support transformer_lm decoders
    • Build ARM64 wheels for macOS
    • Allow loading custom Fairseq extensions and architectures during conversion with the option --user_dir
    • Enable conversion of the Fairseq architectures multilingual_transformer and multilingual_transformer_iwslt_de_en
    • Implement random sampling in beam search using the Gumbel-max trick
    • Generate and publish the Python API reference to https://opennmt.net/CTranslate2

    Fixes and improvements

    • Fix model loading on a GPU with index > 0
    • Fix memory error when running random sampling on GPU with certain batch sizes
    • Fix incorrect tokens order in some converted Marian vocabularies
    • Properly count the number of layers before building the encoder/decoder instead of relying on runtime exceptions
    Source code(tar.gz)
    Source code(zip)
  • v2.15.1(Apr 4, 2022)

  • v2.15.0(Apr 4, 2022)

    New features

    • Expose translator option max_queued_batches to configure the maximum number of queued batches (when the queue is full, future requests will block until a free slot is available)
    • Allow converters to customize the vocabulary special tokens <unk>, <s>, and </s>

    Fixes and improvements

    • Fix compatibility of models converted on Windows with other platforms by saving the vocabulary files with the newline character "\n" instead of "\r\n"
    • Clarify conversion error when no TensorFlow checkpoints are found in the configured model directory
    • Enable fused QKV transposition by switching the heads and time dimensions before the QKV split
    • Cache the prepared source lengths mask in the Transformer decoder state and reuse it in the next decoding steps
    • Pad the output layer to enable Tensor Cores only once instead of updating the layer on each batch
    • Vectorize copy in Concat and Split ops on GPU
    • Factorize all OpenMP parallel for loops to call the parallel_for function
    • Compile CUDA kernels for deprecated Compute Capabilities that are not yet dropped by CUDA:
      • CUDA 11: 3.5 and 5.0
      • CUDA 10: 3.0
    Source code(tar.gz)
    Source code(zip)
  • v2.14.0(Mar 16, 2022)

    New features

    • Include BART and MBART in the list of supported Fairseq architectures
    • Add Fairseq converter option --no_default_special_tokens to require all special tokens to be set by the user during inference, including the decoder start tokens (for example, this is required by MBART-25 to properly set the language tokens)

    Fixes and improvements

    • Fix conversion of Post-Norm Transformers trained with OpenNMT-tf
    • Fix scoring with Fairseq models that used an incorrect decoder start token (Fairseq uses </s> as the decoder start token, not <s>)
    • Fix scoring result to include the end of sentence token
    • Ignore OpenNMT-py options --alignment_layer and --alignment_heads for models that are not trained with alignments
    • Enable batch encoding in return_alternatives translation mode (the decoding still runs sequentially)
    • Make enumerations ctranslate2.specs.Activation and ctranslate2.specs.EmbeddingsMerge public since they could be used to configure the Transformer specification
    • Update oneDNN to 2.5.3
    • Update cpu_features to 0.7.0
    • Update cxxopts to 3.0.0
    • Update spdlog to 1.9.2
    Source code(tar.gz)
    Source code(zip)
  • v2.13.1(Mar 2, 2022)

  • v2.13.0(Feb 28, 2022)

    New features

    • Add converter for Marian and support the collection of OPUS-MT pretrained models
    • Support models applying a layer normalization after the embedding layer (cf. option --layernorm-embedding in Fairseq)
    • Support models using the Swish (a.k.a SiLU) activation function
    • Support models using custom decoder start tokens, which can be passed in the target prefix

    Fixes and improvements

    • Remove unexpected call to a CUDA function in CPU execution when unloading models
    • Add option groups in the translation client help output
    • Use new thrust::cuda::par_nosync execution policy when calling Thrust functions
    • Update Thrust to 1.16.0
    • Update pybind11 to 2.9.1
    Source code(tar.gz)
    Source code(zip)
  • v2.12.0(Feb 1, 2022)

    New features

    • Support models using additional source features (a.k.a. factors)

    Fixes and improvements

    • Fix compilation with CUDA < 11.2
    • Fix incorrect revision number reported in the error message for unsupported model revisions
    • Improve quantization correctness by rounding the value instead of truncating (this change will only apply to newly converted models)
    • Improve default value of intra_threads when the system has less than 4 logical cores
    • Update oneDNN to 2.5.2
    Source code(tar.gz)
    Source code(zip)
  • v2.11.0(Jan 11, 2022)

    Changes

    • With CUDA >= 11.2, the environment variable CT2_CUDA_ALLOCATOR now defaults to cuda_malloc_async which should improve performance on GPU.

    New features

    • Build Python wheels for AArch64 Linux

    Fixes and improvements

    • Improve performance of Gather CUDA kernel by using vectorized copy
    • Update Intel oneAPI to 2022.1
    • Update oneDNN to 2.5.1
    • Log some additional information with CT2_VERBOSE >= 1:
      • Location and compute type of loaded models
      • Version of the dynamically loaded cuBLAS library
      • Selected CUDA memory allocator
    Source code(tar.gz)
    Source code(zip)
  • v2.10.1(Dec 15, 2021)

  • v2.10.0(Dec 13, 2021)

    Changes

    • inter_threads now also applies to GPU translation, where each translation thread is using a different CUDA stream to allow some parts of the GPU execution to overlap

    New features

    • Add option disable_unk to disable the generation of unknown tokens
    • Add function set_random_seed to fix the seed in random sampling
    • [C++] Add constructors in Translator and TranslatorPool classes with ModelReader parameter

    Fixes and improvements

    • Fix incorrect output from the Multinomial op when running on GPU with a small batch size
    • Fix Thrust and CUB headers that were included from the CUDA installation instead of the submodule
    • Fix static library compilation with the default build options (cmake -DBUILD_SHARED_LIBS=OFF)
    • Compile the Docker image and the Linux Python wheels with SSE 4.1 (vectorized kernels are still compiled for AVX and AVX2 with automatic dispatch, but other source files are now compiled with SSE 4.1)
    • Enable /fp:fast for MSVC to mirror -ffast-math that is enabled for GCC and Clang
    • Statically link against oneDNN to reduce the size of published binaries:
      • Linux Python wheels: 43MB -> 17MB
      • Windows Python wheels: 41MB -> 11MB
      • Docker image: 733MB -> 600MB
    Source code(tar.gz)
    Source code(zip)
  • v2.9.0(Dec 1, 2021)

    New features

    • Add GPU support to the Windows Python wheels
    • Support OpenNMT-py and Fairseq options --alignment_layer and --alignment_heads which specify how the multi-head attention is reduced and returned by the Transformer decoder
    • Support dynamic loading of CUDA libraries on Windows

    Fixes and improvements

    • Fix division by zero when normalizing the score of an empty target
    • Fix error that was not raised when the input length is greater than the number of position encodings
    • Improve performance of random sampling on GPU for large values of sampling_topk or when sampling over the full vocabulary
    • Include transformer_align and transformer_wmt_en_de_big_align in the list of supported Fairseq architectures
    • Add a CUDA kernel to prepare the length mask to avoid moving back to the CPU
    Source code(tar.gz)
    Source code(zip)
  • v2.8.1(Nov 17, 2021)

    Fixes and improvements

    • Fix dtype error when reading float16 scores in greedy search
    • Fix usage of MSVC linker option /nodefaultlib that was not correctly passed to the linker
    Source code(tar.gz)
    Source code(zip)
  • v2.8.0(Nov 15, 2021)

    Changes

    • The Linux Python wheels now use Intel OpenMP instead of GNU OpenMP for consistency with other published binaries

    New features

    • Build Python wheels for Windows

    Fixes and improvements

    • Fix segmentation fault when calling Translator.unload_model while an asynchronous translation is running
    • Fix implementation of repetition penalty that should be applied to all previously generated tokens and not just the tokens of the last step
    • Fix missing application of repetition penalty in greedy search
    • Fix incorrect token index when using a target prefix and a vocabulary mapping file
    • Set the OpenMP flag when compiling on Windows with -DOPENMP_RUNTIME=INTEL or -DOPENMP_RUNTIME=COMP
    Source code(tar.gz)
    Source code(zip)
  • v2.7.0(Nov 4, 2021)

    Changes

    • Inputs are now truncated after 1024 tokens by default to limit the maximum memory usage (see translation option max_input_length)

    New features

    • Add translation option max_input_length to limit the model input length
    • Add translation option repetition_penalty to apply an exponential penalty on repeated sequences
    • Add scoring option with_tokens_score to also output token-level scores when scoring a file

    Fixes and improvements

    • Adapt the length penalty formula when using normalize_scores to match other implementations: the scores are divided by pow(length, length_penalty)
    • Implement LayerNorm with a single CUDA kernel instead of 2
    • Simplify the beam search implementation
    Source code(tar.gz)
    Source code(zip)
  • v2.6.0(Oct 15, 2021)

    New features

    • Build wheels for Python 3.10
    • Accept passing the vocabulary as a opennmt.data.Vocab object or a list of tokens in the OpenNMT-tf converter

    Fixes and improvements

    • Fix segmentation fault in greedy search when normalize_scores is enabled but not return_scores
    • Fix segmentation fault when min_decoding_length and max_decoding_length are both set to 0
    • Fix segmentation fault when sampling_topk is larger than the vocabulary size
    • Fix incorrect score normalization in greedy search when max_decoding_length is reached
    • Fix incorrect score normalization in the return_alternatives translation mode
    • Improve error checking when reading the binary model file
    • Apply LogSoftMax in-place during decoding and scoring
    Source code(tar.gz)
    Source code(zip)
Owner
OpenNMT
Open source ecosystem for neural machine translation and neural sequence learning
OpenNMT
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 62 Dec 14, 2022
A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

JinquanPan 58 Jan 3, 2023
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

NeuroSim 11 Dec 15, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Dec 31, 2022
Radeon Rays is ray intersection acceleration library for hardware and software multiplatforms using CPU and GPU

RadeonRays 4.1 Summary RadeonRays is a ray intersection acceleration library. AMD developed RadeonRays to help developers make the most of GPU and to

GPUOpen Libraries & SDKs 980 Dec 29, 2022
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Alex Kotovsky 2 Nov 29, 2021
Toy path tracer for my own learning purposes (CPU/GPU, C++/C#, Win/Mac/Wasm, DX11/Metal, also Unity)

Toy Path Tracer Toy path tracer for my own learning purposes, using various approaches/techs. Somewhat based on Peter Shirley's Ray Tracing in One Wee

Aras Pranckevičius 931 Dec 29, 2022
4eisa40 GPU computing : exploiting the GPU to execute advanced simulations

GPU-computing 4eisa40 GPU computing : exploiting the GPU to execute advanced simulations Activities Parallel programming Algorithms Image processing O

Ecam 4MIN repositories 2 Jan 10, 2022
A lightweight, portable pure C99 onnx inference engine for embedded devices with hardware acceleration support.

Libonnx A lightweight, portable pure C99 onnx inference engine for embedded devices with hardware acceleration support. Getting Started The library's

xboot.org 442 Dec 25, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

NeuroSim 32 Nov 24, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

null 939 Dec 29, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Dec 30, 2022
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.

PrimitivesFittingLib An unified library for fitting multiple primitives from 3D point cloud data with both C++&Python API. The supported primitives ty

Yueci Deng 10 Jun 30, 2022
waifu2x converter ncnn version, runs fast on intel / amd / nvidia GPU with vulkan

waifu2x ncnn Vulkan ncnn implementation of waifu2x converter. Runs fast on Intel / AMD / Nvidia with Vulkan API. waifu2x-ncnn-vulkan uses ncnn project

null 2.4k Dec 24, 2022
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Project DeepSpeech DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Spee

Mozilla 20.8k Jan 9, 2023
NCNN+Int8+YOLOv4 quantitative modeling and real-time inference

NCNN+Int8+YOLOv4 quantitative modeling and real-time inference

pengtougu 20 Dec 6, 2022
ResNet Implementation, Training, and Inference Using LibTorch C++ API

LibTorch C++ ResNet CIFAR Example Introduction ResNet implementation, training, and inference using LibTorch C++ API. Because there is no native imple

Lei Mao 23 Oct 29, 2022
Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and deploy without Python.

Python Inference Script(PyIS) Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and d

Microsoft 13 Nov 4, 2022
A framework for generic hybrid two-party computation and private inference with neural networks

MOTION2NX -- A Framework for Generic Hybrid Two-Party Computation and Private Inference with Neural Networks This software is an extension of the MOTI

ENCRYPTO 15 Nov 29, 2022