Fast parallel CTC.

Overview

Baidu Logo

In Chinese 中文版

warp-ctc

A fast parallel implementation of CTC, on both CPU and GPU.

Introduction

Connectionist Temporal Classification is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition, which is how we have been using it at Baidu's Silicon Valley AI Lab.

DSCTC

The illustration above shows CTC computing the probability of an output sequence "THE CAT ", as a sum over all possible alignments of input sequences that could map to "THE CAT ", taking into account that labels may be duplicated because they may stretch over several time steps of the input data (represented by the spectrogram at the bottom of the image). Computing the sum of all such probabilities explicitly would be prohibitively costly due to the combinatorics involved, but CTC uses dynamic programming to dramatically reduce the complexity of the computation. Because CTC is a differentiable function, it can be used during standard SGD training of deep neural networks.

In our lab, we focus on scaling up recurrent neural networks, and CTC loss is an important component. To make our system efficient, we parallelized the CTC algorithm, as described in this paper. This project contains our high performance CPU and CUDA versions of the CTC loss, along with bindings for Torch. The library provides a simple C interface, so that it is easy to integrate into deep learning frameworks.

This implementation has improved training scalability beyond the performance improvement from a faster parallel CTC implementation. For GPU-focused training pipelines, the ability to keep all data local to GPU memory allows us to spend interconnect bandwidth on increased data parallelism.

Performance

Our CTC implementation is efficient compared with many of the other publicly available implementations. It is also written to be as numerically stable as possible. The algorithm is numerically sensitive and we have observed catastrophic underflow even in double precision with the standard calculation - the result of division of two numbers on the order of 1e-324 which should have been approximately one, instead become infinity when the denominator underflowed to 0. Instead, by performing the calculation in log space, it is numerically stable even in single precision floating point at the cost of significantly more expensive operations. Instead of one machine instruction, addition requires the evaluation of multiple transcendental functions. Because of this, the speed of CTC implementations can only be fairly compared if they are both performing the calculation the same way.

We compare our performance with Eesen, a CTC implementation built on Theano, and a Cython CPU only implementation Stanford-CTC. We benchmark the Theano implementation operating on 32-bit floating-point numbers and doing the calculation in log-space, in order to match the other implementations we compare against. Stanford-CTC was modified to perform the calculation in log-space as it did not support it natively. It also does not support minibatches larger than 1, so would require an awkward memory layout to use in a real training pipeline, we assume linear increase in cost with minibatch size.

We show results on two problem sizes relevant to our English and Mandarin end-to-end models, respectively, where T represents the number of timesteps in the input to CTC, L represents the length of the labels for each example, and A represents the alphabet size.

On the GPU, our performance at a minibatch of 64 examples ranges from 7x faster to 155x faster than Eesen, and 46x to 68x faster than the Theano implementation.

GPU Performance

Benchmarked on a single NVIDIA Titan X GPU.

T=150, L=40, A=28 warp-ctc Eesen Theano
N=1 3.1 ms .5 ms 67 ms
N=16 3.2 ms 6 ms 94 ms
N=32 3.2 ms 12 ms 119 ms
N=64 3.3 ms 24 ms 153 ms
N=128 3.5 ms 49 ms 231 ms
T=150, L=20, A=5000 warp-ctc Eesen Theano
N=1 7 ms 40 ms 120 ms
N=16 9 ms 619 ms 385 ms
N=32 11 ms 1238 ms 665 ms
N=64 16 ms 2475 ms 1100 ms
N=128 23 ms 4950 ms 2100 ms

CPU Performance

Benchmarked on a dual-socket machine with two Intel E5-2660 v3 processors - warp-ctc used 40 threads to maximally take advantage of the CPU resources. Eesen doesn't provide a CPU implementation. We noticed that the Theano implementation was not parallelizing computation across multiple threads. Stanford-CTC provides no mechanism for parallelization across threads.

T=150, L=40, A=28 warp-ctc Stanford-CTC Theano
N=1 2.6 ms 13 ms 15 ms
N=16 3.4 ms 208 ms 180 ms
N=32 3.9 ms 416 ms 375 ms
N=64 6.6 ms 832 ms 700 ms
N=128 12.2 ms 1684 ms 1340 ms
T=150, L=20, A=5000 warp-ctc Stanford-CTC Theano
N=1 21 ms 31 ms 850 ms
N=16 37 ms 496 ms 10800 ms
N=32 54 ms 992 ms 22000 ms
N=64 101 ms 1984 ms 42000 ms
N=128 184 ms 3968 ms 86000 ms

Interface

The interface is in include/ctc.h. It supports CPU or GPU execution, and you can specify OpenMP parallelism if running on the CPU, or the CUDA stream if running on the GPU. We took care to ensure that the library does not perform memory allocation internally, in order to avoid synchronizations and overheads caused by memory allocation.

Compilation

warp-ctc has been tested on Ubuntu 14.04 and OSX 10.10. Windows is not supported at this time.

First get the code:

git clone https://github.com/baidu-research/warp-ctc.git
cd warp-ctc

create a build directory:

mkdir build
cd build

if you have a non standard CUDA install export CUDA_BIN_PATH=/path_to_cuda so that CMake detects CUDA and to ensure Torch is detected, make sure th is in $PATH

run cmake and build:

cmake ../
make

The C library and torch shared libraries should now be built along with test executables. If CUDA was detected, then test_gpu will be built; test_cpu will always be built.

Tests

To run the tests, make sure the CUDA libraries are in LD_LIBRARY_PATH (DYLD_LIBRARY_PATH for OSX).

The Torch tests must be run from the torch_binding/tests/ directory.

Torch Installation

luarocks make torch_binding/rocks/warp-ctc-scm-1.rockspec

You can also install without cloning the repository using

luarocks install http://raw.githubusercontent.com/baidu-research/warp-ctc/master/torch_binding/rocks/warp-ctc-scm-1.rockspec

There is a Torch CTC tutorial.

Contributing

We welcome improvements from the community, please feel free to submit pull requests.

Known Issues / Limitations

The CUDA implementation requires a device of at least compute capability 3.0.

The CUDA implementation supports a maximum label length of 639 (timesteps are unlimited).

Comments
  • infinite CTC costs

    infinite CTC costs

    Apologies if I misunderstood something, but running the following code seems to return infinite CTC costs, though the gradients are fine.

    th> require 'warp_ctc'
    th> acts = torch.Tensor({{0,-150,0,0,0}}):float()
    th> grads = torch.zeros(acts:size()):float()
    th> labels = {{1}}
    th> sizes = {1}
    th> cpu_ctc(acts, grads, labels, sizes)
    {
      1 : inf
    }
    th> print(grads)
     0.2500  0.0000  0.2500  0.2500  0.2500
    [torch.FloatTensor of size 1x5]
    

    Is this simply something that we have to guard against in our own Softmax code?

    opened by shaobohou 10
  • Tensorflow Test Error

    Tensorflow Test Error

    Hi,

    I've installed the tensorflow binding for warp_ctc, the installation went without a hitch but after running the commandpython setup.py test I end up getting an error, I have pasted the output to the command below

    I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
    I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
    I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
    I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
    I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
    running test
    running egg_info
    writing warpctc_tensorflow.egg-info/PKG-INFO
    writing top-level names to warpctc_tensorflow.egg-info/top_level.txt
    writing dependency_links to warpctc_tensorflow.egg-info/dependency_links.txt
    reading manifest file 'warpctc_tensorflow.egg-info/SOURCES.txt'
    writing manifest file 'warpctc_tensorflow.egg-info/SOURCES.txt'
    running build_ext
    copying build/lib.linux-x86_64-2.7/warpctc_tensorflow/kernels.so -> warpctc_tensorflow
    running test
    running egg_info
    writing warpctc_tensorflow.egg-info/PKG-INFO
    writing top-level names to warpctc_tensorflow.egg-info/top_level.txt
    writing dependency_links to warpctc_tensorflow.egg-info/dependency_links.txt
    reading manifest file 'warpctc_tensorflow.egg-info/SOURCES.txt'
    writing manifest file 'warpctc_tensorflow.egg-info/SOURCES.txt'
    running build_ext
    copying build/lib.linux-x86_64-2.7/warpctc_tensorflow/kernels.so -> warpctc_tensorflow
    testBasicCPU (test_ctc_loss_op.CTCLossTest) ... I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
    name: GeForce GTX TITAN X
    major: 5 minor: 2 memoryClockRate (GHz) 1.2155
    pciBusID 0000:03:00.0
    Total memory: 11.92GiB
    Free memory: 11.77GiB
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    ok
    testBasicGPU (test_ctc_loss_op.CTCLossTest) ... I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    ERROR
    test_session (test_ctc_loss_op.CTCLossTest)
    Returns a TensorFlow Session for use in executing tests. ... ok
    test_basic_cpu (test_warpctc_op.WarpCTCTest) ... I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    ok
    test_basic_gpu (test_warpctc_op.WarpCTCTest) ... I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    ok
    test_multiple_batches_cpu (test_warpctc_op.WarpCTCTest) ... I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    ok
    test_multiple_batches_gpu (test_warpctc_op.WarpCTCTest) ... I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0)
    ok
    test_session (test_warpctc_op.WarpCTCTest)
    Returns a TensorFlow Session for use in executing tests. ... ok
    
    ======================================================================
    ERROR: testBasicGPU (test_ctc_loss_op.CTCLossTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/home/sarunac4/RNN/warp-ctc/tensorflow_binding/tests/test_ctc_loss_op.py", line 227, in testBasicGPU
        self._testBasic(use_gpu=True)
      File "/home/sarunac4/RNN/warp-ctc/tensorflow_binding/tests/test_ctc_loss_op.py", line 220, in _testBasic
        self._testCTCLoss(inputs, seq_lens, labels, loss_truth, grad_truth, use_gpu=use_gpu)
      File "/home/sarunac4/RNN/warp-ctc/tensorflow_binding/tests/test_ctc_loss_op.py", line 83, in _testCTCLoss
        (tf_loss, tf_grad) = sess.run([loss, grad])
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
        run_metadata_ptr)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
        feed_dict_string, options, run_metadata)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
        target_list, options, run_metadata)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
        raise type(e)(node_def, op, message)
    InvalidArgumentError: Cannot assign a device to node 'CTCLoss': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
             [[Node: CTCLoss = CTCLoss[_kernel="WarpCTC", ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/device:GPU:0"](Const_3, Const, Const_1, CTCLoss/sequence_length)]]
    
    Caused by op u'CTCLoss', defined at:
      File "setup.py", line 126, in <module>
        test_suite = 'setup.discover_test_suite',
      File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
        dist.run_commands()
      File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
        cmd_obj.run()
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/setuptools/command/test.py", line 210, in run
        self.run_tests()
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/setuptools/command/test.py", line 231, in run_tests
        testRunner=self._resolve_as_ep(self.test_runner),
      File "/usr/lib/python2.7/unittest/main.py", line 94, in __init__
        self.parseArgs(argv)
      File "/usr/lib/python2.7/unittest/main.py", line 149, in parseArgs
        self.createTests()
      File "/usr/lib/python2.7/unittest/main.py", line 158, in createTests
        self.module)
      File "/usr/lib/python2.7/unittest/loader.py", line 130, in loadTestsFromNames
        suites = [self.loadTestsFromName(name, module) for name in names]
      File "/usr/lib/python2.7/unittest/loader.py", line 91, in loadTestsFromName
        module = __import__('.'.join(parts_copy))
      File "/home/sarunac4/RNN/warp-ctc/tensorflow_binding/setup.py", line 126, in <module>
        test_suite = 'setup.discover_test_suite',
      File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
        dist.run_commands()
      File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
        cmd_obj.run()
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/setuptools/command/test.py", line 210, in run
        self.run_tests()
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/setuptools/command/test.py", line 231, in run_tests
        testRunner=self._resolve_as_ep(self.test_runner),
      File "/usr/lib/python2.7/unittest/main.py", line 95, in __init__
        self.runTests()
      File "/usr/lib/python2.7/unittest/main.py", line 232, in runTests
        self.result = testRunner.run(self.test)
      File "/usr/lib/python2.7/unittest/runner.py", line 151, in run
        test(result)
      File "/usr/lib/python2.7/unittest/suite.py", line 70, in __call__
        return self.run(*args, **kwds)
      File "/usr/lib/python2.7/unittest/suite.py", line 108, in run
        test(result)
      File "/usr/lib/python2.7/unittest/suite.py", line 70, in __call__
        return self.run(*args, **kwds)
      File "/usr/lib/python2.7/unittest/suite.py", line 108, in run
        test(result)
      File "/usr/lib/python2.7/unittest/suite.py", line 70, in __call__
        return self.run(*args, **kwds)
      File "/usr/lib/python2.7/unittest/suite.py", line 108, in run
        test(result)
      File "/usr/lib/python2.7/unittest/suite.py", line 70, in __call__
        return self.run(*args, **kwds)
      File "/usr/lib/python2.7/unittest/suite.py", line 108, in run
        test(result)
      File "/usr/lib/python2.7/unittest/case.py", line 395, in __call__
        return self.run(*args, **kwds)
      File "/usr/lib/python2.7/unittest/case.py", line 331, in run
        testMethod()
      File "/home/sarunac4/RNN/warp-ctc/tensorflow_binding/tests/test_ctc_loss_op.py", line 227, in testBasicGPU
        self._testBasic(use_gpu=True)
      File "/home/sarunac4/RNN/warp-ctc/tensorflow_binding/tests/test_ctc_loss_op.py", line 220, in _testBasic
        self._testCTCLoss(inputs, seq_lens, labels, loss_truth, grad_truth, use_gpu=use_gpu)
      File "/home/sarunac4/RNN/warp-ctc/tensorflow_binding/tests/test_ctc_loss_op.py", line 76, in _testCTCLoss
        sequence_length=seq_lens)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/ctc_ops.py", line 144, in ctc_loss
        ctc_merge_repeated=ctc_merge_repeated)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_ctc_ops.py", line 162, in _ctc_loss
        name=name)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
        op_def=op_def)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
        original_op=self._default_original_op, op_def=op_def)
      File "/home/sarunac4/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
        self._traceback = _extract_stack()
    
    InvalidArgumentError (see above for traceback): Cannot assign a device to node 'CTCLoss': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
             [[Node: CTCLoss = CTCLoss[_kernel="WarpCTC", ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/device:GPU:0"](Const_3, Const, Const_1, CTCLoss/sequence_length)]]
    
    
    ----------------------------------------------------------------------
    Ran 8 tests in 0.616s
    
    FAILED (errors=1)
    

    Please let me know how to fix this.

    Regards, Deepak

    opened by razor1179 8
  • ValueError: list.remove(x): x not in list

    ValueError: list.remove(x): x not in list

    When I use python setup.py install to install, the terminal will display the following information. setup.py:66: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors warnings.warn("Assuming tensorflow was compiled without C++11 ABI. " running install running bdist_egg running egg_info writing warpctc_tensorflow.egg-info/PKG-INFO writing dependency_links to warpctc_tensorflow.egg-info/dependency_links.txt writing top-level names to warpctc_tensorflow.egg-info/top_level.txt writing manifest file 'warpctc_tensorflow.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py running build_ext Traceback (most recent call last): File "setup.py", line 149, in test_suite = 'setup.discover_test_suite', File "/usr/local/lib/python3.6/dist-packages/setuptools/init.py", line 145, in setup return distutils.core.setup(**attrs) File "/usr/lib/python3.6/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands self.run_command(cmd) File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install.py", line 67, in run self.do_egg_install() File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install.py", line 109, in do_egg_install self.run_command('bdist_egg') File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/usr/local/lib/python3.6/dist-packages/setuptools/command/bdist_egg.py", line 172, in run cmd = self.call_command('install_lib', warn_dir=0) File "/usr/local/lib/python3.6/dist-packages/setuptools/command/bdist_egg.py", line 158, in call_command self.run_command(cmdname) File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install_lib.py", line 11, in run self.build() File "/usr/lib/python3.6/distutils/command/install_lib.py", line 109, in build self.run_command('build_ext') File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 78, in run _build_ext.run(self) File "/usr/local/lib/python3.6/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run self.build_extensions() File "setup.py", line 122, in build_extensions self.compiler.compiler_so.remove('-Wstrict-prototypes') ValueError: list.remove(x): x not in list

    opened by izhaojinlong 6
  • please look the error when I run

    please look the error when I run "test_cpu"

    follow the Compilation explain, create "test_cpu" file. when I try to run the program, but it's error like this: $ ./test_cpu ./test_cpu: error while loading shared libraries: libwarpctc.so: cannot open shared object file: No such file or directory "libwarpctc.so" exists. What is the problem?

    opened by coolshinejonson 5
  • Compile failure due to undeclared gpuStream_t

    Compile failure due to undeclared gpuStream_t

    The prior PR seems to have missed a declaration. In compiling, I see the following error:

    In file included from /tmp/luarocks_warp-ctc-scm-2-3241/warp-ctc/torch_binding/binding.cpp:18:0:
    /tmp/luarocks_warp-ctc-scm-1-3241/warp-ctc/include/detail/reduce.h:4:85: error: 'gpuStream_t' has not been declared
     ctcStatus_t reduce_negate(const T* input, T* output, int rows, int cols, bool axis, gpuStream_t stream);
                                                                                         ^
    /tmp/luarocks_warp-ctc-scm-1-3241/warp-ctc/include/detail/reduce.h:6:82: error: 'gpuStream_t' has not been declared
     ctcStatus_t reduce_exp(const T* input, T* output, int rows, int cols, bool axis, gpuStream_t stream);
                                                                                      ^
    /tmp/luarocks_warp-ctc-scm-1-3241/warp-ctc/include/detail/reduce.h:8:82: error: 'gpuStream_t' has not been declared
     ctcStatus_t reduce_max(const T* input, T* output, int rows, int cols, bool axis, gpuStream_t stream);
                                                                                      ^
    make[2]: *** [CMakeFiles/warp_ctc.dir/torch_binding/binding.cpp.o] Error 1
    make[1]: *** [CMakeFiles/warp_ctc.dir/all] Error 2
    make[1]: *** Waiting for unfinished jobs....
    make: *** [all] Error 2
    

    In reverting back to cd828e5b6c3b953b82af73f7f44cddc393a20efa, I am able to successfully build it.

    opened by matwhite 4
  • Gradient output from batch using torch binding

    Gradient output from batch using torch binding

    Sorry if this is answered elsewhere or blatantly obvious, but I'm not entirely sure of the formatting of the gradients after carrying out a batch like below:

    th>acts = torch.Tensor({{0,0,0,0,0},{1,2,3,4,5},{-5,-4,-3,-2,-1},
                            {0,0,0,0,0},{6,7,8,9,10},{-10,-9,-8,-7,-6},
                            {0,0,0,0,0},{11,12,13,14,15},{-15,-14,-13,-12,-11}}):cuda()
    th>labels = {{1}, {3,3}, {2,3}}
    th>sizes = {1,3,3}
    th>grads = torch.Tensor(acts:size())
    th>gpu_ctc(acts, grads, labels, sizes)
    
    {
      1 : 1.6094379425049
      2 : 7.355742931366
      3 : 4.938850402832
    }
    

    Should we expect the gradients to also be in column major formatting like how we put our multiple sequences in (i.e we would need to reverse the batching steps we did for the activation sequence with the gradients)? Thanks!

    opened by SeanNaren 4
  • test_gpu fails on v100 but passes on k80

    test_gpu fails on v100 but passes on k80

    On v100, test_cpu passed, but test_gpu failed.

    v100 has the same configuration on gcc & cuda & cudnn with k80, but k80's test_gpu passed.

    Error message:

    Running GPU tests
    terminate called after throwing an instance of 'std::runtime_error'
      what():  Error: compute_ctc_loss in small_test, stat = execution failed
    Aborted (core dumped)
    

    Tried to modify CMakeLists.txt, but it didn't work.

    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=0")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_FORCE_INLINES")
    #set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_60,code=sm_60")
    
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_53,code=sm_53")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_60,code=sm_60")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_61,code=sm_61")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_62,code=sm_62")
    
    

    I have seen some similar issues on GTX1080 and GTX1060, but there is no solution. Is it something related to the GPU version?

    Help me plz...

    cuda: 9.0.176 cudnn: 7.0 gcc & g++: 5.4.0

    opened by huangdi95 3
  • installing without CUDA

    installing without CUDA

    I'm trying to install warp-ctc on a google compute instance that does not have CUDA.

    Below is the output from both the cmake and make step:

    CMakeLists.txt	doc  examples  include	LICENSE  python  README.md  src  tests
    [email protected]:/srv/deepspeech/src/transforms/warp-ctc# mkdir build
    [email protected]:/srv/deepspeech/src/transforms/warp-ctc# cd build/
    [email protected]:/srv/deepspeech/src/transforms/warp-ctc/build# cmake ..
    -- The C compiler identification is GNU 4.9.2
    -- The CXX compiler identification is GNU 4.9.2
    -- Check for working C compiler: /usr/bin/cc
    -- Check for working C compiler: /usr/bin/cc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++
    -- Check for working CXX compiler: /usr/bin/c++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    CUDA_TOOLKIT_ROOT_DIR not found or specified
    -- Could NOT find CUDA (missing:  CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "6.5")
    -- cuda found FALSE
    -- Building shared library with no GPU support
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /srv/deepspeech/src/transforms/warp-ctc/build
    [email protected]:/srv/deepspeech/src/transforms/warp-ctc/build# make
    Scanning dependencies of target warpctc
    [ 50%] Building CXX object CMakeFiles/warpctc.dir/src/ctc_entrypoint.cpp.o
    /srv/deepspeech/src/transforms/warp-ctc/src/ctc_entrypoint.cpp:49:30: error: ‘cudaStream_t’ has not been declared
                                  cudaStream_t stream,
                                  ^
    /srv/deepspeech/src/transforms/warp-ctc/src/ctc_entrypoint.cpp: In function ‘int compute_ctc_gpu(const float*, float*, const int*, const int*, const int*, int, int, float*, int, char*)’:
    /srv/deepspeech/src/transforms/warp-ctc/src/ctc_entrypoint.cpp:50:53: error: conflicting declaration of C function ‘int compute_ctc_gpu(const float*, float*, const int*, const int*, const int*, int, int, float*, int, char*)’
                                  char *ctc_gpu_workspace){
                                                         ^
    In file included from /srv/deepspeech/src/transforms/warp-ctc/src/ctc_entrypoint.cpp:5:0:
    /srv/deepspeech/src/transforms/warp-ctc/include/ctc.h:99:5: note: previous declaration ‘int compute_ctc_gpu(const float*, float*, const int*, const int*, const int*, int, int, float*, CUstream, char*)’
     int compute_ctc_gpu(const float* const activations,
         ^
    CMakeFiles/warpctc.dir/build.make:54: recipe for target 'CMakeFiles/warpctc.dir/src/ctc_entrypoint.cpp.o' failed
    make[2]: *** [CMakeFiles/warpctc.dir/src/ctc_entrypoint.cpp.o] Error 1
    CMakeFiles/Makefile2:95: recipe for target 'CMakeFiles/warpctc.dir/all' failed
    make[1]: *** [CMakeFiles/warpctc.dir/all] Error 2
    Makefile:117: recipe for target 'all' failed
    make: *** [all] Error 2
    

    I'm not familiar enough with C compiling to know where this error is or how to fix it, but I'm assuming that it has something to do with the make file not finding CUDA.

    Any help would be greatly appreciated.

    opened by michaelcapizzi 3
  • Failing GPU tests on CUDA 8

    Failing GPU tests on CUDA 8

    I'm running on a GTX 1070. This is compiled with CUDA 8.0 release candidate.

    The ./test_gpu script fails with the following error:

    Running GPU tests
    terminate called after throwing an instance of 'std::runtime_error'
      what():  Error: compute_ctc_loss in small_test, stat = execution failed
    Aborted (core dumped)
    

    Attaching a debugger, I see:

    (gdb) run
    Starting program: /warp-ctc/build/test_gpu 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Running GPU tests
    [New Thread 0x7fffef84b700 (LWP 10325)]
    [New Thread 0x7fffef04a700 (LWP 10326)]
    terminate called after throwing an instance of 'std::runtime_error'
      what():  Error: compute_ctc_loss in small_test, stat = execution failed
    
    Program received signal SIGABRT, Aborted.
    0x00007ffff6d55c37 in __GI_raise ([email protected]=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
    56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
    (gdb) bt
    #0  0x00007ffff6d55c37 in __GI_raise ([email protected]=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
    #1  0x00007ffff6d59028 in __GI_abort () at abort.c:89
    #2  0x00007ffff7660535 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    #3  0x00007ffff765e6d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    #4  0x00007ffff765e703 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    #5  0x00007ffff765e922 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    #6  0x0000000000403e33 in throw_on_error (message=0x4097f8 "Error: compute_ctc_loss in small_test", status=<optimized out>) at /storage/deep_learning/warp-ctc/tests/test.h:11
    #7  small_test () at /storage/deep_learning/warp-ctc/tests/test_gpu.cu:63
    #8  0x000000000040360f in main () at /storage/deep_learning/warp-ctc/tests/test_gpu.cu:333
    
    opened by mattndu 3
  • Errors: test_gpu doesn't work

    Errors: test_gpu doesn't work

    Hi all I have these errors when I try to run 'test_gpu':

    Running GPU tests terminate called after throwing an instance of 'std::runtime_error' what(): Error: compute_ctc_loss in small_test, stat = execution failed Aborted (core dumped)

    Note that: I can run 'test_cpu' without any problems (Running CPU tests, Tests pass)

    I was wondering I might miss some libraries or anything else. Every kind of suggestion would be appreciated!

    Thank you so much.

    wontfix 
    opened by oaomly 3
  • Building error

    Building error

    Have you encountered the following error?

    Scanning dependencies of target warpctc
    Linking CXX shared library libwarpctc.so
    /usr/bin/ld: cannot find -lTHC
    collect2: error: ld returned 1 exit status
    make[2]: *** [libwarpctc.so] Error 1
    make[1]: *** [CMakeFiles/warpctc.dir/all] Error 2
    make: *** [all] Error 2
    

    How to overcome it?

    opened by Jihadik 3
  • why

    why

    -- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19042. -- cuda found TRUE -- Building shared library with GPU support -- Configuring done CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: CUDA_curand_LIBRARY (ADVANCED) linked by target "test_gpu" in directory D:/python_project/Multimodal-Transformer-master/warp-ctc

    -- Generating done CMake Generate step failed. Build files cannot be regenerated correctly.

    opened by pipijiev12 0
  • Failed to run cmake on Pytorch1.8 Cuda10.1

    Failed to run cmake on Pytorch1.8 Cuda10.1

    I just run the cmake command as: cmake -DCMAKE_PREFIX_PATH=/opt/conda/lib/python3.7/site-packages/torch/share/cmake/Torch ../

    The output is

    -- The C compiler identification is GNU 5.4.0 -- The CXX compiler identification is GNU 5.4.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
    -- Found CUDA: /usr/local/cuda (found suitable version "10.2", minimum required is "6.5") -- Found CUDA: /usr/local/cuda (found version "10.2") -- Caffe2: CUDA detected: 10.2 -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda -- Caffe2: Header version is: 10.2 -- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
    -- Found cuDNN: v7.6.5 (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so) CMake Warning at /opt/conda/lib/python3.7/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:198 (message): Failed to compute shorthash for libnvrtc.so Call Stack (most recent call first): /opt/conda/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include) /opt/conda/lib/python3.7/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package) CMakeLists.txt:14 (FIND_PACKAGE)

    -- Autodetected CUDA architecture(s): 7.0 -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70 -- Found Torch: /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
    -- cuda found TRUE -- Torch found /opt/conda/lib/python3.7/site-packages/torch/share/cmake/Torch -- Building shared library with GPU support -- NVCC_ARCH_FLAGS-DONNX_NAMESPACE=onnx_c2-gencodearch=compute_70,code=sm_70-Xcudafe--diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl-std=c++14-Xcompiler-fPIC--expt-relaxed-constexpr--expt-extended-lambda -O2 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 --std=c++11 -Xcompiler -fopenmp -- Build tests -- Building Torch Bindings with GPU support CMake Error at CMakeLists.txt:209 (INSTALL): INSTALL TARGETS given no LIBRARY DESTINATION for shared library target "warpctc".

    CMake Error at CMakeLists.txt:217 (ADD_TORCH_PACKAGE): Unknown CMake command "ADD_TORCH_PACKAGE".

    -- Configuring incomplete, errors occurred! See also "/home/bml/storage/mnt/v-u8wb7huq2nkd2m6o/package/warp-ctc-master/build/CMakeFiles/CMakeOutput.log". See also "/home/bml/storage/mnt/v-u8wb7huq2nkd2m6o/package/warp-ctc-master/build/CMakeFiles/CMakeError.log".

    opened by maxh2010 1
  • on colab it retrun --> nvcc fatal   : Unsupported gpu architecture 'compute_30'

    on colab it retrun --> nvcc fatal : Unsupported gpu architecture 'compute_30'

    code------------ !pwd %cd /content !git clone -b pytorch_bindings https://github.com/SeanNaren/warp-ctc.git %cd /content/warp-ctc !mkdir build %cd /content/warp-ctc/build !export CUDA_BIN_PATH=/usr/local/cuda !cmake .. #!cmake -D CMAKE_C_COMPILER:string=gcc-5 -D CMAKE_CXX_COMPILER=g++-5 ../ !make

    error------------------------- /content/warp-ctc/build /content /content/warp-ctc/build [ 11%] Building NVCC (Device) object CMakeFiles/warpctc.dir/src/warpctc_generated_reduce.cu.o nvcc fatal : Unsupported gpu architecture 'compute_30' CMake Error at warpctc_generated_reduce.cu.o.cmake:219 (message): Error generating /content/warp-ctc/build/CMakeFiles/warpctc.dir/src/./warpctc_generated_reduce.cu.o

    CMakeFiles/warpctc.dir/build.make:70: recipe for target 'CMakeFiles/warpctc.dir/src/warpctc_generated_reduce.cu.o' failed make[2]: *** [CMakeFiles/warpctc.dir/src/warpctc_generated_reduce.cu.o] Error 1 CMakeFiles/Makefile2:146: recipe for target 'CMakeFiles/warpctc.dir/all' failed make[1]: *** [CMakeFiles/warpctc.dir/all] Error 2 Makefile:129: recipe for target 'all' failed make: *** [all] Error 2

    opened by aa-amory 0
  • Upgrade CMakeLists to use v3.15 as minimum and simplify building steps

    Upgrade CMakeLists to use v3.15 as minimum and simplify building steps

    I simplified the building procedure, using a modern cmake language version (3.15). I think it can be further optimized but I cannot test on every platform.

    opened by alealv 0
Owner
Baidu Research
Baidu Research
Baidu Research
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish: MONOlithic LIner equation Solvers for Highly-parallel architecture monolish is a linear equation solver library that monolithically fuses va

RICOS Co. Ltd. 172 Aug 2, 2022
Parallel library for approximate inference on discrete Bayesian networks

baylib C++ library Baylib is a parallel inference library for discrete Bayesian networks supporting approximate inference algorithms both in CPU and G

Massimiliano Pronesti 22 Sep 30, 2022
Parallel programming for everyone.

Tutorial | Examples | Forum Documentation | 简体中文文档 | Contributor Guidelines Overview Taichi (太极) is a parallel programming language for high-performan

Taichi Developers 21.2k Sep 29, 2022
SMID, Parallel computing of CNN

Parallel Computing in Deep Reference Network 1. Introduction Deep neural networks are made up of a number of layers of linked nodes, each of which imp

null 1 Dec 22, 2021
PaRSEC: the Parallel Runtime Scheduler and Execution Controller for micro-tasks on distributed heterogeneous systems.

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

null 11 Aug 18, 2022
Fast and robust certifiable relative pose estimation

Fast and Robust Relative Pose Estimation for Calibrated Cameras This repository contains the code for the relative pose estimation between two central

null 40 Jul 23, 2022
A light and fast internet speed plugin(DDE).

lfxNet English | 简体中文 | 繁體中文 lfxNet 是一款轻量、快速的实时显示系统资源信息的应用程序。 目录 背景 编译 下载 作者 鸣谢 协议 背景 喜爱 DDE ,为 Deepin 爱好者、也是开发者之一。因习惯其它系统上有一个任务栏网速插件,但在 Deepin/UOS上没有

偕臧 54 Sep 5, 2022
fast zksnark prover

rapidsnark rapid snark is a zkSnark proof generation written in C++ and intel assembly. That generates proofs created in circom and snarkjs very fast.

iden3 60 Sep 23, 2022
Blazing fast, composable, Pythonic quantile filters.

Rolling Quantiles for NumPy Hyper-efficient and composable filters. Simple, clean, intuitive interface. Supports streaming data or bulk processing. Py

Myrl Marmarelis 122 Aug 2, 2022
Caffe: a fast open framework for deep learning.

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR)/The Berke

Berkeley Vision and Learning Center 32.9k Oct 5, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.7k Oct 3, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.2k Oct 3, 2022
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 641 Oct 4, 2022
ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

Xtra Computing Group 1.4k Sep 25, 2022
Fast, differentiable sorting and ranking in PyTorch

Torchsort Fast, differentiable sorting and ranking in PyTorch. Pure PyTorch implementation of Fast Differentiable Sorting and Ranking (Blondel et al.)

Teddy Koker 628 Sep 23, 2022
The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

Chenyu 103 Sep 14, 2022
Unofficial third-party implementation of FFD (fast feature detector) published in IEEE TIP 2020.

fast_feature_detector Unofficial third-party implementation of FFD (fast feature detector) published in IEEE TIP 2020. Caution I have not got any perm

kamino410 12 Feb 17, 2022
SOINN / 聚类 / 无监督聚类 / 快速 / clustering / unsupervised clustering / fast

____ ___ ___ _ _ _ _ / ___| / _ \_ _| \ | | \ | | \___ \| | | | || \| | \| | ___) | |_| | || |\ | |\ | |____/ \___/___|_| \_|_| \_| SOIN

lfs 12 Aug 4, 2022
CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU executio

CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU execution. The goal is to provide comprehensive inference features and be the most efficient and cost-effective solution to deploy standard neural machine translation systems such as Transformer models.

OpenNMT 343 Oct 2, 2022