WeNet is to close the gap between research and production end-to-end (E2E) speech recognition models,

Overview

WeNet

中文版

License Python-Version

Docs | Tutorial | Papers | Runtime (x86) | Runtime (android)

We share neural Net together.

The main motivation of WeNet is to close the gap between research and production end-to-end (E2E) speech recognition models, to reduce the effort of productionizing E2E models, and to explore better E2E models for production.

Highlights

  • Production first and production ready: The python code of WeNet meets the requirements of TorchScript, so the model trained by WeNet can be directly exported by Torch JIT and use LibTorch for inference. There is no gap between the research model and production model. Neither model conversion nor additional code is required for model inference.
  • Unified solution for streaming and non-streaming ASR: WeNet implements Unified Two Pass (U2) framework to achieve accurate, fast and unified E2E model, which is favorable for industry adoption.
  • Portable runtime: Several demos will be provided to show how to host WeNet trained models on different platforms, including server x86 and on-device android.
  • Light weight: WeNet is designed specifically for E2E speech recognition, with clean and simple code. It is all based on PyTorch and its corresponding ecosystem. It has no dependency on Kaldi, which simplifies installation and usage.

Performance Benchmark

Please see examples/$dataset/s0/README.md for benchmark on different speech datasets.

Installation

  • Clone the repo
git clone https://github.com/wenet-e2e/wenet.git
# [option 1]
conda create -n wenet python=3.8
conda activate wenet
pip install -r requirements.txt
conda install pytorch==1.6.0 cudatoolkit=10.1 torchaudio=0.6.0 -c pytorch

# [option 2: working on machine with GPU 3090]
conda create -n wenet python=3.8
conda activate wenet
pip install -r requirements.txt
conda install pytorch torchvision torchaudio=0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
  • Optionally, if you want to use x86 runtime or language model(LM), you have to build the runtime as follows. Otherwise, you can just ignore this step.
# runtime build requires cmake 3.14 or above
cd runtime/server/x86
mkdir build && cd build && cmake .. && cmake --build .

Discussion & Communication

Please scan the QR code on the left to follow the offical account of WeNet.

In addition to discussing in Github Issues, we created a WeChat group for better discussion and quicker response. Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.

If you can not access the QR image, please access it on gitee.

Contributors

Acknowledge

  1. We borrowed a lot of code from ESPnet for transformer based modeling.
  2. We borrowed a lot of code from Kaldi for WFST based decoding for LM integration.
  3. We referred EESEN for building TLG based graph for LM integration.
  4. We referred to OpenTransformer for python batch inference of e2e models.

Citations

@inproceedings{yao2021wenet,
  title={WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit},
  author={Yao, Zhuoyuan and Wu, Di and Wang, Xiong and Zhang, Binbin and Yu, Fan and Yang, Chao and Peng, Zhendong and Chen, Xiaoyu and Xie, Lei and Lei, Xin},
  booktitle={Proc. Interspeech},
  year={2021},
  address={Brno, Czech Republic }
  organization={IEEE}
}

@article{zhang2020unified,
  title={Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition},
  author={Zhang, Binbin and Wu, Di and Yao, Zhuoyuan and Wang, Xiong and Yu, Fan and Yang, Chao and Guo, Liyong and Hu, Yaguang and Xie, Lei and Lei, Xin},
  journal={arXiv preprint arXiv:2012.05481},
  year={2020}
}

@article{wu2021u2++,
  title={U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition},
  author={Wu, Di and Zhang, Binbin and Yang, Chao and Peng, Zhendong and Xia, Wenjing and Chen, Xiaoyu and Lei, Xin},
  journal={arXiv preprint arXiv:2106.05642},
  year={2021}
}
Issues
  • BUG for ONNX inference

    BUG for ONNX inference

    when i inference with u2++_conformer, execute just 50 wav files, a bug will be thrown as below: I0515 00:10:40.909694 146005 decoder_main.cc:67] num frames 1118 I0515 00:10:41.026697 146005 decoder_main.cc:86] Partial result: 在机关 I0515 00:10:41.056061 146005 decoder_main.cc:86] Partial result: 在机关服务 I0515 00:10:41.085124 146005 decoder_main.cc:86] Partial result: 在机关围剿 I0515 00:10:41.110785 146005 decoder_main.cc:86] Partial result: 在机关围剿和 I0515 00:10:41.136417 146005 decoder_main.cc:86] Partial result: 在机关围剿和 I0515 00:10:41.176227 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程 I0515 00:10:41.217715 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的 I0515 00:10:41.251241 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中 I0515 00:10:41.282459 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢 I0515 00:10:41.311969 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定 I0515 00:10:41.341024 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是 I0515 00:10:41.398414 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是一军的 I0515 00:10:41.429834 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是一军的 I0515 00:10:41.462321 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是一军的主要将领 Segmentation fault (core dumped) this file shoud be processed completely, i will go deep to locate the bug info.

    onnxruntime: 1.10.0 and 1.11.1

    opened by Fred-cell 26
  • LibTorch gpu cmake error

    LibTorch gpu cmake error

    Hello, when I execute " mkdir build && cd build && cmake -DGRPC=ON ..", the following error is reported, Native environment: centors 7.9 nvidia: 11.3 cuda version: 11


    (wenet_gpu) [[email protected] build]$ cmake -DGPU=ON .. -- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Populating libtorch -- Configuring done -- Generating done -- Build files have been written to: /home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild [ 11%] Performing download step (download, verify and extract) for 'libtorch-populate' -- verifying file... file='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild/libtorch-populate-prefix/src/libtorch-shared-with-deps-1.10.0%2Bcu113.zip' -- File already exists and hash match (skip download): file='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild/libtorch-populate-prefix/src/libtorch-shared-with-deps-1.10.0%2Bcu113.zip' SHA256='0996a6a4ea8bbc1137b4fb0476eeca25b5efd8ed38955218dec1b73929090053' -- extracting... src='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild/libtorch-populate-prefix/src/libtorch-shared-with-deps-1.10.0%2Bcu113.zip' dst='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-src' -- extracting... [tar xfz] -- extracting... [analysis] -- extracting... [rename] -- extracting... [clean up] -- extracting... done [ 22%] No patch step for 'libtorch-populate' [ 33%] No update step for 'libtorch-populate' [ 44%] No configure step for 'libtorch-populate' [ 55%] No build step for 'libtorch-populate' [ 66%] No install step for 'libtorch-populate' [ 77%] No test step for 'libtorch-populate' [ 88%] Completed 'libtorch-populate' [100%] Built target libtorch-populate -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
    -- Found CUDA: /usr/local/cuda-11.3 (found version "11.3") -- Caffe2: CUDA detected: 11.3 -- Caffe2: CUDA nvcc is: /usr/local/cuda-11.3/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda-11.3 CMake Error at fc_base/libtorch-src/share/cmake/Caffe2/public/cuda.cmake:75 (message): Caffe2: Couldn't determine version from header: Change Dir: /home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeTmp

    Run Build Command(s):/usr/bin/gmake cmTC_3d968/fast

    /usr/bin/gmake -f CMakeFiles/cmTC_3d968.dir/build.make CMakeFiles/cmTC_3d968.dir/build

    gmake[1]: 进入目录“/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeTmp”

    Building CXX object CMakeFiles/cmTC_3d968.dir/detect_cuda_version.cc.o

    /usr/bin/c++ -I/usr/local/cuda-11.3/include -std=c++14 -pthread -fPIC -o CMakeFiles/cmTC_3d968.dir/detect_cuda_version.cc.o -c /home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/detect_cuda_version.cc

    c++: 错误:unrecognized command line option ‘-std=c++14’

    gmake[1]: *** [CMakeFiles/cmTC_3d968.dir/detect_cuda_version.cc.o] 错误 1

    gmake[1]: 离开目录“/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeTmp”

    gmake: *** [cmTC_3d968/fast] 错误 2

    Call Stack (most recent call first): fc_base/libtorch-src/share/cmake/Caffe2/Caffe2Config.cmake:88 (include) fc_base/libtorch-src/share/cmake/Torch/TorchConfig.cmake:68 (find_package) cmake/libtorch.cmake:52 (find_package) CMakeLists.txt:35 (include)

    -- Configuring incomplete, errors occurred! See also "/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeOutput.log". See also "/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeError.log".


    please what should Ido?

    opened by zhaoyinjiang9825 16
  • onnx runtime error 2: not enough space: expected 318080, got 314240

    onnx runtime error 2: not enough space: expected 318080, got 314240

    Describe the bug 这个bug或许是tritonserver的问题,在使用代码中提供的gpu生产服务(triton server)部署后。直接测试encoder模块时,我需要直接发送fbank的特征到服务器上,此时假如我有三个线程并发的请求,每个线程请求的的step是随机的,也就是fbank的时间步是不一样长的,此时转写的速度会比较慢,但不会报错。这里猜测是由于每个请求的step不一样长,所以没办法组成batch,服务器端的dynamic_batching等待组batch等待耗时较长。于是添加参数max_queue_delay_microseconds等于70000,也就是70ms后就不要等待batch了直接预测,此时客户端就会有一定概率出现异常,异常如下: Traceback (most recent call last): File "debug_encoder.py", line 30, in input_numpy response = triton_client.infer("encoder", File "/opt/conda/lib/python3.8/site-packages/tritonclient/grpc/init.py", line 1156, in infer raise_error_grpc(rpc_error) File "/opt/conda/lib/python3.8/site-packages/tritonclient/grpc/init.py", line 62, in raise_error_grpc raise get_error_grpc(rpc_error) from None tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] onnx runtime error 2: not enough space: expected 318080, got 314240 此时我请求的三个fbank特征的step是482, 497, 485,dims是80,batch_size是1,318080刚好是497808,也就是模型在预测497那个请求时,莫名遇到空间不足的问题。而且在多次并发请求中,这种错是偶发的,出现后继续请求也有可能成功。如果不并发请求,而是一个个请求的话,则不会报错,如果并发请求的尺寸是固定的也不会报错,只有在并发请求不固定长度的时候,且max_queue_delay_microseconds比较小时会报错。

    Desktop (please complete the following information):

    • triton server:21.11
    • 服务器 内存16G 显存16G T4显卡,应该不可能是显存或者内存不足
    opened by piekey1994 15
  • Runtime: words containing non-ASCII characters are concatenated without space

    Runtime: words containing non-ASCII characters are concatenated without space

    The runtime outputs decoded words containing non-ASCII characters as concatenated with neighbouring words: e.g. "aa ää xx yy" is transformed to "aaääxx yy".

    This is caused by the code block starting at https://github.com/wenet-e2e/wenet/blob/604231391c81efdf06454dbc99406bbc06cb030d/runtime/core/decoder/torch_asr_decoder.cc#L217

    I understand that this is done in order to output Chinese "words" correctly (i.e., without spaces). However, this should at least be configurable, as currently it breaks wenet runtime for most other languages (i.e. those that have words with non-ASCII characters and where words are separated by spaces in the orthography).

    opened by alumae 14
  • cmake compile server/x86 error

    cmake compile server/x86 error

    Describe the bug A clear and concise description of what the bug is.

    environment: centos7
    gcc version 7.5.0
    cmake version: 3.18.3
    CUDA version: 10.2
    gpu version:  Quadro RTX 8000
    
    
    install steps:
    $ conda create -n wenet python=3.8
    $ conda activate wenet
    $ pip install -r requirements.txt
    $ conda install pytorch==1.6.0 cudatoolkit=10.2 torchaudio -c pytorch
    
    $ cd wenet/runtime/server/x86/
    $ mkdir build && cd build && cmake .. && cmake --build .
    

    ERROR is as follows:

    [ 50%] Linking CXX executable ctc_prefix_beam_search_test
    /home4/md510/cmake-3.18.3/bin/cmake -E cmake_link_script CMakeFiles/ctc_prefix_beam_search_test.dir/link.txt --verbose=1
    /home3/md510/gcc-7.5.0/bin/g++  -std=c++14 -pthread -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -DC10_USE_GLOG -L/cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64 CMakeFiles/ctc_prefix_beam_search_test.dir/decoder/ctc_prefix_beam_search_test.cc.o -o ctc_prefix_beam_search_test   -L/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build/openfst/lib  -Wl,-rpath,/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build/openfst/lib:/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib lib/libgtest_main.a lib/libgmock.a libdecoder.a lib/libgtest.a ../fc_base/libtorch-src/lib/libtorch.so -Wl,--no-as-needed,/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch_cpu.so -Wl,--as-needed ../fc_base/libtorch-src/lib/libc10.so -lpthread -Wl,--no-as-needed,/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch.so -Wl,--as-needed ../fc_base/libtorch-src/lib/libc10.so kaldi/libkaldi-decoder.a kaldi/libkaldi-lat.a kaldi/libkaldi-util.a kaldi/libkaldi-base.a libutils.a -lfst 
    /home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch_cpu.so: undefined reference to `[email protected]_2.23'
    /home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch_cpu.so: undefined reference to `[email protected]_2.23'
    collect2: error: ld returned 1 exit status
    gmake[2]: *** [ctc_prefix_beam_search_test] Error 1
    gmake[2]: Leaving directory `/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build'
    gmake[1]: *** [CMakeFiles/ctc_prefix_beam_search_test.dir/all] Error 2
    gmake[1]: Leaving directory `/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build'
    gmake: *** [all] Error 2
    
    

    Could you help me to solve it ?

    opened by shanguanma 14
  • DLL load failed while importing _wenet: 找不到指定的模块。

    DLL load failed while importing _wenet: 找不到指定的模块。

    我安装了wenet, pip install wenet. 安装提示成功了。 我用例子程序做识别。 程序如下: import sys import wenet

    def get_text_from_wav(dir, wav): model_dir = dir wav_file = wav decoder = wenet.Decoder(model_dir) ans = decoder.decode_wav(wav_file) print(ans)

    if name == 'main': dir = "./models" wav = "./1.wav" get_text_from_wav(dir,wav)

    但是运行报错如下: Traceback (most recent call last): File "D:\codes\speech2word\main.py", line 2, in import wenet File "D:\codes\speech2word\venv\lib\site-packages\wenet_init_.py", line 1, in from .decoder import Decoder # noqa File "D:\codes\speech2word\venv\lib\site-packages\wenet\decoder.py", line 17, in import _wenet ImportError: DLL load failed while importing _wenet: 找不到指定的模块。

    请问如何解决?

    opened by billqu01 13
  • [Draft] Cache control v2

    [Draft] Cache control v2

    This is not a merge-ready PR, I just push my testing code for discussion and further evaluation (such as GPU perf, ONNX export, ...).

    Performance on CPU (intel i7-10510U @ 1.80GHz), RTF from 0.1 -> 0.07, about 30% improvement: image

    Detailed descriptions (in Chinese): https://horizonrobotics.feishu.cn/sheets/shtcniLh77AgP6NJAXhd5UHXDwh

    Test code:

    bash rtf.sh --api 1 > log.txt.1
    bash rtf.sh --api 2 > log.txt.2
    grep "RTF:" log.txt.1
    grep "RTF:" log.txt.2
    

    u2++_conformer.zip: https://horizonrobotics.feishu.cn/file/boxcnO50Ea8m0rR2p9FwJ8ZHEIc words.txt: https://horizonrobotics.feishu.cn/file/boxcnBpSEOWoBSIgLdlHetsjOFd

    opened by xingchensong 13
  • Use DDP training to get stuck

    Use DDP training to get stuck

    Describe the bug

    I got stuck when using DDP training with my own wenet and my own data. And stuck(GPU utilization 100%) at the beginning of the second epoch every time. After debugging, it was found to be stuck in this position:

    # wenet/utils/executor.py
    with torch.cuda.amp.autocast(scaler is not None):
        loss, loss_att, loss_ctc = model(
            feats, feats_lengths, target, target_lengths)
    

    Environment

    CentOS Linux release 7.8.2003 (Core) GPU Driver Version: 450.80.02 CUDA Version: 10.2 torch==1.8.0 torchaudio==1.8.1 torchvision==0.9.0

    Some Attempts

    I did some attempts later and found: 1 gpu no problem multi gpu stuck static batch no problem dynamic batch stuck conformer no problem unified_conformer stuck

    Other attempts: Upgrade pytorch version to 1.9.0, 1.10.0 is useless Set num_workers=0/1 is useless V100 -> P40 useless Sleep 1 minute after completing an epoch is useless NCCL is completely stuck without error log GLOO error log:

    2021-12-07 11:36:17,011 INFO Epoch 0 CV info cv_loss 115.3632936241356
    2021-12-07 11:36:17,011 INFO Epoch 1 TRAIN info lr 6.08e-06
    2021-12-07 11:36:17,014 INFO using accumulate grad, new batch size is 8 times larger than before
    2021-12-07 11:36:17,335 INFO Epoch 0 CV info cv_loss 115.36239801458647
    2021-12-07 11:36:17,335 INFO Epoch 1 TRAIN info lr 6.200000000000001e-06
    2021-12-07 11:36:17,338 INFO using accumulate grad, new batch size is 8 times larger than before
    2021-12-07 11:36:17,579 INFO Epoch 0 CV info cv_loss 115.36309641650827
    2021-12-07 11:36:17,579 INFO Epoch 1 TRAIN info lr 5.96e-06
    2021-12-07 11:36:17,582 INFO using accumulate grad, new batch size is 8 times larger than before
    2021-12-07 11:36:17,926 INFO Epoch 0 CV info cv_loss 115.36275817930736
    2021-12-07 11:36:17,926 INFO Checkpoint: save to checkpoint exp/conformer/0.pt
    2021-12-07 11:36:18,889 INFO Epoch 1 TRAIN info lr 6.32e-06
    2021-12-07 11:36:18,892 INFO using accumulate grad, new batch size is 8 times larger than before
    terminate called after throwing an instance of 'gloo::EnforceNotMet'
      what():  [enforce fail at /opt/conda/conda-bld/pytorch_1614378062065/work/third_party/gloo/gloo/transport/tcp/pair.cc:490] op.preamble.length <= op.nbytes. 939336 vs 4
    ./run.sh: line 165:  7108 Aborted                 (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --data_type $data_type --symbol_table $dict --train_data data/$train_set/data.list --cv_data data/dev/data.list ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $world_size --ddp.rank $rank --ddp.dist_backend $dist_backend --num_workers 8 $cmvn_opts --pin_memory
    /homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000002.tar'>
      self._target(*self._args, **self._kwargs)
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    /homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000110.tar'>
      self._target(*self._args, **self._kwargs)
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    /homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000112.tar'>
      self._target(*self._args, **self._kwargs)
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    /homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000075.tar'>
      self._target(*self._args, **self._kwargs)
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    /homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000001.tar'>
      self._target(*self._args, **self._kwargs)
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    /homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000086.tar'>
      self._target(*self._args, **self._kwargs)
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    Traceback (most recent call last):
      File "wenet/bin/train.py", line 277, in <module>
        main()
      File "wenet/bin/train.py", line 250, in main
        executor.train(model, optimizer, scheduler, train_data_loader, device,
      File "/homepath/tools/wenet-uio/wenet/utils/executor.py", line 71, in train
        loss.backward()
      File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
        Variable._execution_engine.run_backward(
    RuntimeError: [/opt/conda/conda-bld/pytorch_1614378062065/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [11.88.165.7]:54008
    Traceback (most recent call last):
      File "wenet/bin/train.py", line 277, in <module>
        main()
      File "wenet/bin/train.py", line 250, in main
        executor.train(model, optimizer, scheduler, train_data_loader, device,
      File "/homepath/tools/wenet-uio/wenet/utils/executor.py", line 71, in train
        loss.backward()
      File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
        Variable._execution_engine.run_backward(
    RuntimeError: Application timeout caused pair closure
    

    To Reproduce

    Finally, pull the latest wenet code,reproduced the above problem with aishell recipe:

    data_type=shard
    train_config=conf/train_unified_conformer.yaml
    cmvn=false
    dynamic batch
    accum_grad=8
    

       How should this be solved? Thank you.

    opened by 601222543 12
  • Decoding hangs when using LM rescoring

    Decoding hangs when using LM rescoring

    I'm following this tutorial to use LM rescoring for decoding: https://github.com/wenet-e2e/wenet/blob/23a61b212bf2c3886546925913f5574f779f474a/examples/librispeech/s0/run.sh#L234

    I didn't re-train a model and instead, I use the pre-trained conformer model. I had no problem building the TLG.fst, but ./tools/decode.sh hangs forever when evaluating on the test set. Could you provide any suggestions on where the problem would be and how to debug this?

    The below is the code I used for LM rescoring (I took this out from run.sh):

    pretrained_model=wenet/models/20210216_conformer_exp
    dict=$pretrained_model/words.txt
    bpemodel=$pretrained_model/train_960_unigram5000
    
    lm=data/local/lm
    lexicon=data/local/dict/lexicon.txt
    mkdir -p $lm
    mkdir -p data/local/dict
    
    # 7.1 Download & format LM
    which_lm=3-gram.pruned.1e-7.arpa.gz
    if [ ! -e ${lm}/${which_lm} ]; then
        wget http://www.openslr.org/resources/11/${which_lm} -P ${lm}
    fi
    echo "unzip lm($which_lm)..."
    gunzip -k ${lm}/${which_lm} -c > ${lm}/lm.arpa
    echo "Lm saved as ${lm}/lm.arpa"
    
    # 7.2 Prepare dict
    unit_file=$dict
    bpemodel=$bpemodel
    # use $dir/words.txt (unit_file) and $dir/train_960_unigram5000 (bpemodel)
    # if you download pretrained librispeech conformer model
    cp $unit_file data/local/dict/units.txt
    if [ ! -e ${lm}/librispeech-lexicon.txt ]; then
        wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -P ${lm}
    fi
    echo "build lexicon..."
    tools/fst/prepare_dict.py $unit_file ${lm}/librispeech-lexicon.txt \
        $lexicon $bpemodel.model
    echo "lexicon saved as '$lexicon'"
    
    # 7.3 Build decoding TLG
    tools/fst/compile_lexicon_token_fst.sh \
       data/local/dict data/local/tmp data/local/lang
    tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;
    
    # 7.4 Decoding with runtime
    echo "Start decoding..."
    fst_dir=data/lang_test
    dir=$pretrained_model
    recog_set="test_clean"
    for test in ${recog_set}; do
        ./tools/decode.sh --nj 2 \
            --beam 10.0 --lattice_beam 5 --max_active 7000 --blank_skip_thresh 0.98 \
            --ctc_weight 0.5 --rescoring_weight 1.0 --acoustic_scale 1.2 \
            --fst_path $fst_dir/TLG.fst \
            data/$test/wav.scp.10 data/$test/text.10 $dir/final.zip $fst_dir/words.txt \
            $dir/lm_with_runtime_${test}
        tail $dir/lm_with_runtime_${test}/wer
    done
    
    opened by boliangz 12
  • macOS M1 support?

    macOS M1 support?

    [ 96%] Linking CXX shared library libwenet_api.dylib ld: warning: ignoring file ../../../fc_base/libtorch-src/lib/libtorch.dylib, building for macOS-arm64 but attempting to link with file built for macOS-x86_64

    opened by jinfagang 11
  • [bin] onnx_streaming and onnx_non_streaming

    [bin] onnx_streaming and onnx_non_streaming

    I find it quite easy to export onnx (either streaming model or non-streaming model) within our recently developed new cache API. The most tricky thing for onnx export is to ensure that all if-else branches in forward_chunk will always take the same path with a given decoding setup (i.e., 16/-1, 16/4, 16/0 or -1/-1) (this will turn the dynamic if-else branch into a static branch and next_cache_start will be a fixed value). To do this, we can simply adjust the shape of cache & mask tensors for the first input chunk according to chunk_size and num_left_chunks.

    TODO

    • [x] check consistency between onnx_model and pytorch_model on random data
    • [x] check CER on aishell-1 testset

    Q&A

    1. Why don't we need jit.script for if-else branches? As mentioned above, we force our model to select the same branch for every chunk.

    2. Why don't we need jit.script for tensor slice? **From @Mddct ** : ONNX supports dynamic slice for opset>=13, see https://github.com/onnx/onnx/blob/main/docs/Operators.md#Slice

    # Author: @Mddct 
    import torch
    from torch import nn
    
    class BadFirst(nn.Module):
        def __init__(self):
            super().__init__()
    
        def forward(self, x: torch.Tensor, index: torch.Tensor):
    
            x_firsts = x[:index]
            return x_firsts
    
    
    m = BadFirst().eval()
    x = torch.ones(44)
    index = torch.tensor(0, dtype=torch.int)
    
    res = m(x, index=index) # this works
    torch.onnx.export(m, (x, index),  "m.onnx",  opset_version=14)
    
    opened by xingchensong 11
  • torch batch forward

    torch batch forward

    Describe the bug For non-streaming inference, it seems that the exported onnx model can support batch forward, and I try to modify the asr_model.py file to export jitscript model which also supports batch forward(not the chunk forward) just like the onnx, the encoder can run as expected but failed in decoder embed. I trace and think maybe the ignore_id -1 of padding results in the bug. I have checked the code and think there are not differences, this confuses me, why onnx can export batch forward model but jitscript can't ? I think since you guys don't export the interfaces, maybe you know more, so can you help me out? thank you.

    To Reproduce Steps to reproduce the behavior:

    1. Add jit export functions in asr_model.py
    2. run batch input data in x86 runtime
    @torch.jit.export
    def batch_forward_encoder(
        self,
        xs: torch.Tensor,
        xs_lens: torch.Tensor,
        beam_size: int = 10
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        encoder_out, encoder_mask = self.encoder.forward(xs, xs_lens, -1, -1)
        encoder_out_lens = encoder_mask.squeeze(1).sum(1)
        ctc_log_probs = self.ctc.log_softmax(encoder_out)
        encoder_out_lens = encoder_out_lens.int()
        beam_log_probs, beam_log_probs_idx = torch.topk(ctc_log_probs, beam_size, dim=2)
        return encoder_out, encoder_out_lens, ctc_log_probs, beam_log_probs, beam_log_probs_idx
    
    @torch.jit.export
    def batch_forward_attention_decoder(
        self,
        encoder_out: torch.Tensor,
        encoder_lens: torch.Tensor,
        hyps_pad_sos_eos: torch.Tensor,
        hyps_lens_sos: torch.Tensor,
        r_hyps_pad_sos_eos: torch.Tensor,
        ctc_score: torch.Tensor,
        ctc_weight: float = 0.5,
        reverse_weight: float = 0.0,
        beam_size: int = 10
    ) -> torch.Tensor:
        B, T, F = encoder_out.shape
        bz = beam_size
        B2 = B * bz
        encoder_out = encoder_out.repeat(1, bz, 1).view(B2, T, F)
        encoder_mask = ~make_pad_mask(encoder_lens, T).unsqueeze(1)
        encoder_mask = encoder_mask.repeat(1, bz, 1).view(B2, 1, T)
        T2 = hyps_pad_sos_eos.shape[2] - 1
        hyps_pad = hyps_pad_sos_eos.view(B2, T2 + 1)
        hyps_lens = hyps_lens_sos.view(B2,)
        hyps_pad_sos = hyps_pad[:, :-1].contiguous()
        hyps_pad_eos = hyps_pad[:, 1:].contiguous()
    
        r_hyps_pad = r_hyps_pad_sos_eos.view(B2, T2 + 1)
        r_hyps_pad_sos = r_hyps_pad[:, :-1].contiguous()
        r_hyps_pad_eos = r_hyps_pad[:, 1:].contiguous()
    
        decoder_out, r_decoder_out, _ = self.decoder(
            encoder_out, encoder_mask, hyps_pad_sos, hyps_lens, r_hyps_pad_sos,
            reverse_weight)
        decoder_out = torch.nn.functional.log_softmax(decoder_out, dim=-1)
        V = decoder_out.shape[-1]
        decoder_out = decoder_out.view(B2, T2, V)
        mask = ~make_pad_mask(hyps_lens, T2)  # B2 x T2
        # mask index, remove ignore id
        index = torch.unsqueeze(hyps_pad_eos * mask, 2)
        score = decoder_out.gather(2, index).squeeze(2)  # B2 X T2
        # mask padded part
        score = score * mask
        decoder_out = decoder_out.view(B, bz, T2, V)
        if reverse_weight > 0:
            r_decoder_out = torch.nn.functional.log_softmax(r_decoder_out, dim=-1)
            r_decoder_out = r_decoder_out.view(B2, T2, V)
            index = torch.unsqueeze(r_hyps_pad_eos * mask, 2)
            r_score = r_decoder_out.gather(2, index).squeeze(2)
            r_score = r_score * mask
            score = score * (1 - reverse_weight) + reverse_weight * r_score
            r_decoder_out = r_decoder_out.view(B, bz, T2, V)
        score = torch.sum(score, dim=1)  # B2
        score = torch.reshape(score, (B, bz)) + ctc_weight * ctc_score
        best_index = torch.argmax(score, dim=1)
        return best_index
    

    Expected behavior run successfully

    opened by murphypei 4
  • python performance issues

    python performance issues

    The python interface provided by wenet can only be executed in a single thread, right? The original C++ code needs to load and read the model every time a Decoder object is created. It does not support creating multiple Decoders in the same thread. If I want to support multi-threading, I should How to modify?

    opened by tuocheng0824 3
  • How to add language model to X86_GPU?

    How to add language model to X86_GPU?

    I see "Add language model: set --lm_path in the convert_start_server.sh. Notice the path of your language model is the path in docker." in the x86_GPU guide doc。 please how can i do it!

    opened by zhaoyinjiang9825 6
  • Streaming performance issues on upgrading to release v2.0.0

    Streaming performance issues on upgrading to release v2.0.0

    Describe the bug On updating to release v2.0.0, I've been noticing some performance issues when running real-time audio streams against a quantized e2e model (no LM) via runtime/server/x86/bin/websocket_server_main. For some stretches of time, performance may be comparable between v1 and v2, but there are points where I can expect to see upwards of 20s delay on a given response. Outside of a few minor updates related to the switch, nothing else (e.g. resource allocations) has been changed on my end.

    Thus far, I haven't been able to pinpoint much of a pattern to the lag, except that it seems to consistently happen (in addition to other times) at the start of the stream. Have you observed any similar performance issues between v1 and v2, or is there some v2-specific runtime configuration I may have missed?

    Expected behavior Comparable real-time performance between releases v1 and v2.

    Screenshots The following graphs show the results from a single test. The x-axes represent the progression of the audio file being tested, and the y-axes represent round-trip response times from wenet minus some threshold, i.e. any data points above 0 indicate additional round-trip latency above an acceptable threshold (in my case, 500ms). As you can see, in the v1 graph responses are largely generated and returned below the threshold time (with the exception of a few final-marked transcripts). However, in the v2 graph, there are several lengthy periods during which responses take an unusually long time to return (I've capped the graph at 2s for clearer viewing, but in reality responses are taking up to 20s to return).

    Wenet v1 Snag_4f1b28

    Wenet v2 Snag_508b72

    Additional context Both tests were run with wenet hosted via AWS ECS/EC2. So far as I've seen, increasing CPU + memory allocations to the wenet container doesn't seem to resolve the issue.

    opened by kangnari 7
  • [transducer] decoding strategy

    [transducer] decoding strategy

    Open this to track different decoding strategies of the transducer model. I will update some materials and code snippets here Suggestions welcome and hope to collaborate

    enhancement help wanted 
    opened by Mddct 4
Releases(v2.0.1)
  • v2.0.1(Jun 21, 2022)

  • v2.0.0(Jun 14, 2022)

    The following features are stable.

    • [x] U2++ framework for better accuracy
    • [x] n-gram + WFST language model solution
    • [x] Context biasing(hotword) solution
    • [x] Very big data training support with UIO
    • [x] More dataset support, including WenetSpeech, GigaSpeech, HKUST and so on.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jun 21, 2021)

    Model

    • propose and support U2++, as the following graph shows, which uses both forward and backward information at training and decoding.

    image

    • support dynamic left chunk training and decoding, so we can limit history chunk at decoding to save memory and computation.
    • support distributed training.

    Dataset

    Now we support the following five standard speech datasets, and we got SOTA result or close to SOTA result. | 数据集 | 语言 | 数据量(h) | 测试集 | CER/WER | SOTA | |-------------|------|-----------|------------|---------|---------------| | aishell-1 | 中文 | 200 | test | 4.36 | 4.36(WeNet) | | aishell-2 | 中文 | 1000 | test_ios | 5.39 | 5.39(WeNet) | | multi-cn | 中文 | 2385 | / | / | / | | librispeech | 英文 | 1000 | test_clean | 2.66 | 2.10(EspNet) | | gigaspeech | 英文 | 10000 | test | 11.0 | 10.80(EspNet) |

    Productivity

    Here are some features related to productivity.

    • LM support. Here is the system design or LM supporting. WeNet can work with/without LM according to your applications/scenarios.

    image

    • timestamp support.
    • n-best support.
    • endpoint support.
    • gRPC support
    • further refine x86 server and on-device android recipe.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Feb 4, 2021)

Owner
Production First and Production Ready End-to-End Speech Toolkit (Powered by Mobvoi & NWPU)
null
Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for real-time gesture recognition.

Gesture Recognition Toolkit (GRT) The Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for re

Nicholas Gillian 780 Aug 5, 2022
PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

PocketSphinx 5prealpha This is PocketSphinx, one of Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech r

null 3.1k Aug 9, 2022
Very portable voice recorder with speech recognition.

DictoFun Small wearable voice recorder. NRF52832-based. Concept Device was initiated after my frustration while using voice recorder for storing ideas

Roman 5 Feb 3, 2022
BladeDISC - BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

BladeDISC Introduction Overview Features and Roadmap Frontend Framework Support Matrix Backend Support Matrix Deployment Solutions Numbers of Typical

Alibaba 404 Aug 12, 2022
OpenSpeaker is a completely independent and open source speaker recognition project.

OpenSpeaker is a completely independent and open source speaker recognition project. It provides the entire process of speaker recognition including multi-platform deployment and model optimization.

ZY 29 Jul 22, 2022
Number recognition with MNIST on Raspberry Pi Pico + TensorFlow Lite for Microcontrollers

About Number recognition with MNIST on Raspberry Pi Pico + TensorFlow Lite for Microcontrollers Device Raspberry Pi Pico LCDディスプレイ 2.8"240x320 SPI TFT

iwatake 48 Jul 28, 2022
ICRA 2021 - Robust Place Recognition using an Imaging Lidar

Robust Place Recognition using an Imaging Lidar A place recognition package using high-resolution imaging lidar. For best performance, a lidar equippe

Tixiao Shan 275 Jul 28, 2022
In this tutorial, we will use machine learning to build a gesture recognition system that runs on a tiny microcontroller, the RP2040.

Pico-Motion-Recognition This Repository has the code used on the 2 parts tutorial TinyML - Motion Recognition Using Raspberry Pi Pico The first part i

Marcelo Rovai 16 Jun 18, 2022
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
A simple facial recognition script using OpenCV's FaceRecognizer module implemented in C++

Local Binary Patterns Histogram Recognizer A proyect that implements the LBPHRecognizer class of the OpenCV library to determine if a detected face co

Pablo Agustín Ortega-Kral 0 Jan 18, 2022
UVIC ECE 499 Real-Time Gesture Recognition Project

Welcome to GitHub Pages You can use the editor on GitHub to maintain and preview the content for your website in Markdown files. Whenever you commit t

Lyndon Bauto 3 Sep 21, 2019
TensorVox is an application designed to enable user-friendly and lightweight neural speech synthesis in the desktop

TensorVox is an application designed to enable user-friendly and lightweight neural speech synthesis in the desktop, aimed at increasing accessibility to such technology.

null 134 Aug 9, 2022
🐸 Coqui STT is an open source Speech-to-Text toolkit which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers

Coqui STT ( ?? STT) is an open-source deep-learning toolkit for training and deploying speech-to-text models. ?? STT is battle tested in both producti

Coqui.ai 1.4k Aug 8, 2022
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Project DeepSpeech DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Spee

Mozilla 20k Aug 8, 2022
ORB-SLAM3 is the first real-time SLAM library able to perform Visual, Visual-Inertial and Multi-Map SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models.

Just to test for my research, and I add coordinate transformation to evaluate the ORB_SLAM3. Only applied in research, and respect the authors' all work.

B.X.W 5 Jul 11, 2022
CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU executio

CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU execution. The goal is to provide comprehensive inference features and be the most efficient and cost-effective solution to deploy standard neural machine translation systems such as Transformer models.

OpenNMT 315 Aug 8, 2022
VNOpenAI 23 Jul 31, 2022