kaldi-asr/kaldi is the official location of the Kaldi project.


Build Status Gitpod Ready-to-Code Kaldi Speech Recognition Toolkit

To build the toolkit: see ./INSTALL. These instructions are valid for UNIX systems including various flavors of Linux; Darwin; and Cygwin (has not been tested on more "exotic" varieties of UNIX). For Windows installation instructions (excluding Cygwin), see windows/INSTALL.

To run the example system builds, see egs/README.txt

If you encounter problems (and you probably will), please do not hesitate to contact the developers (see below). In addition to specific questions, please let us know if there are specific aspects of the project that you feel could be improved, that you find confusing, etc., and which missing features you most wish it had.

Kaldi information channels

For HOT news about Kaldi see the project site.

Documentation of Kaldi:

  • Info about the project, description of techniques, tutorial for C++ coding.
  • Doxygen reference of the C++ code.

Kaldi forums and mailing lists:

We have two different lists

  • User list kaldi-help
  • Developer list kaldi-developers:

To sign up to any of those mailing lists, go to http://kaldi-asr.org/forums.html:

Development pattern for contributors

  1. Create a personal fork of the main Kaldi repository in GitHub.
  2. Make your changes in a named branch different from master, e.g. you create a branch my-awesome-feature.
  3. Generate a pull request through the Web interface of GitHub.
  4. As a general rule, please follow Google C++ Style Guide. There are a few exceptions in Kaldi. You can use the Google's cpplint.py to verify that your code is free of basic mistakes.

Platform specific notes

PowerPC 64bits little-endian (ppc64le)


  • Kaldi supports cross compiling for Android using Android NDK, clang++ and OpenBLAS.
  • See this blog post for details.
  • show L2 norm of parameters during training.

    show L2 norm of parameters during training.

    In addition, set affine to false for batchnorm layers and switch to SGD optimizer.

    The training is still running and a screenshot of the L2-norms of the training parameters is as follows:

    Screen Shot 2020-02-12 at 09 05 51

    I will post the decoding results once it is done.

    opened by csukuangfj 67
  • Wake-word detection

    Wake-word detection

    Results of the regular LF-MMI based recipes:

    Mobvoi: EER=~0.2%, FRR=1.02% at FAH=1.5 vs. FRR=3.8% at FAH=1.5 (Mobvoi paper)

    SNIPS: EER=~0.1%, FRR=0.08% at FAH=0.5 vs. FRR=0.12% at FAH=0.5 (SNIPS paper)

    E2E LF-MMI recipes are still being run to confirm the reproducibility of the previous results.

    opened by freewym 67
  • Multilingual using modified configs

    Multilingual using modified configs

    This is a modified multilingual setup based on new xconfig and training scripts. In this setup, xconfig used to create network configuration for multilingual training. Also the egs generation moved out of training script and multilingual egs dir passed to train_raw_dnn.py. Also a new script added for average posterior computation and prior adjustment.

    opened by pegahgh 65
  • CUDA context creation problem in nnet3 training with

    CUDA context creation problem in nnet3 training with "--use-gpu=wait" option

    I am not sure if this is a Kaldi issue but I thought someone might have an idea.

    First some context. I am trying to tune a few TDNN chain models on a workstation with 2 Maxwell Titan X 12GB cards. The data sets I am working with are fairly small (Babel full language packs with 40-80 hours audio). Initially I set the number of initial and final training jobs to 2 and trained the models with scripts adapted from babel and swbd recipes. While this worked without any issues, I noticed that the models were overtraining, so I tried tuning relu-dim, number of epochs and xent-regularize with one of the language packs to see if I could get a better model. Eventually the best model I got was with a single epoch and xent-regularize=0.25 (WER base model: 45.5% vs best model: 41.4%). To see if the training schedule might have any further effects on the model performance, I also tried training with --num-jobs-initial=2, --num-jobs-final=8 after setting the GPUs to "default" compute mode to allow the creation of multiple CUDA contexts. I added 2 seconds delay between individual jobs so that earlier jobs would start allocating device memory before a new job is scheduled on the device with the largest amount of free memory. This mostly worked fine, except towards the end when 8 jobs were distributed 5-3 between the two cards. The resulting model had 40.9% WER after 2 epochs and the log probability difference between the train and validation sets was also smaller than before. It seems like the training schedule (number of jobs, learning rate, etc. at each iteration) has an effect on the model performance in this small data scenario. Maybe averaging gradients across a larger number of jobs is beneficial, or the learning rate schedule is somehow tuned for this type of training schedule.

    Now the actual problem. Since large number of jobs seemed to work better for me, I wanted to remove the job delay hack, set GPUs back to "exclusive process" compute mode and take advantage of the --use-gpu=wait option while scheduling the training jobs. However, it seems like I am missing something. If I launch multiple training processes with the --use-gpu=wait option while GPUs are in "exclusive process" compute mode, only one process can create a CUDA context on a given GPU card even after that one process completes. My expectation was that other processes would wait for the GPUs to be available and then one by one acquire the GPUs and complete their work. I added a few debug statements to GetCudaContext function to see what the problem was. cudaDeviceSynchronize call returns "all CUDA-capable devices are busy or unavailable" even after processes running on the GPUs are long gone. Any ideas?

    opened by dogancan 63
  • Modify TransitionModel for more compact chain-model graphs

    Modify TransitionModel for more compact chain-model graphs

    Place holder for addressing #1031 . WIP log:

    1. self_loop_pdf_class added to HmmState, done
    2. self_loop_pdf added to Tuple in TransitionModel. done
    3. another branch of ContextDependencyInterface::GetPdfInfo. ugly done
    4. create test code for new structures. done
    5. back compatability for all read code. done
    6. normal HMM validation using RM. done
    7. chain code modification. done
    8. chain validation using RM. done
    9. iterate 2nd version of GetPdfInfo. done
    10. documents and comments. tbd...
    opened by naxingyu 63
  • add PyTorch's DistributedDataParallel training.

    add PyTorch's DistributedDataParallel training.

    support distributed training across multiple GPUs.


    • there are lots of code duplicates

    Part of the training log

    2020-02-19 13:55:10,646 INFO [ddp_train.py:160] Device (1) processing 1100/4724(23.285351%) global average objf: -0.225449 over 6165760.0 frames, current batch average objf: -0.130735 over 6400 frames, epoch 0
    2020-02-19 13:55:55,251 INFO [ddp_train.py:160] Device (0) processing 1200/4724(25.402202%) global average objf: -0.216779 over 6732672.0 frames, current batch average objf: -0.123979 over 3840 frames, epoch 0
    2020-02-19 13:55:55,252 INFO [ddp_train.py:160] Device (1) processing 1200/4724(25.402202%) global average objf: -0.216412 over 6738176.0 frames, current batch average objf: -0.132368 over 4736 frames, epoch 0

    The training seems working.

    opened by csukuangfj 62
  • Is there any speaker diarization documentation and already trained model?

    Is there any speaker diarization documentation and already trained model?

    Hi there, thanks for Kaldi :)

    I want to perform speaker diarization on a set of audio recordings. I believe Kaldi recently added the speaker diarization feature. I have managed to find this link, however, I have not been able to figure out how to use it since there is very little documentation. Also, may I ask is there any already trained model on conversions in English that I can use off-the-shelf, please?

    Thanks a lot!

    opened by bwang482 61
  • expose egs as Dataloader

    expose egs as Dataloader

    Expose egs as a Dataloader in PyTorch, training time now decreased from 150mins to 90mins for 6 epochs with 4 workers.


    ||TDNN-F(Pytorch, Adam, delta dropout without ivector ) from @fanlu | TDNN-F(Pytorch, Adam, delta dropout without ivector ) this PR 2nd run | TDNN-F(Pytorch, Adam, delta dropout without ivector ) this PR 1st run | this Pr with commit 0d8aada to make dropout go to zero at the end | |--|--|--|--|--| |dev_cer|6.10|6.13|6.18|6.12 |dev_wer|13.86|13.89|13.96|13.92 |test_cer|7.14|7.19|7.20|7.26 |test_wer|15.49|15.54|15.66|15.63 |training_time|151mins|88mins|84mins|

    WER/CER increase may come from:

    • Shuffle, we do not shuffle egs-minibatch during each epoch.
    • Dropout, we use pseudo_epoch (one scp file is one pseudo-epoch) to compute data_fraction in dropout, that is relatively too much coarse-grained than using batch_idx

    Note that I have tried copy|shuffle|merge in dataloader (see code below"), but seems that it will take as much time as (or even a little more time than) the original approach (egs as a Dataset), I may do further experiment to look into this:

     scp_rspecifier = scp_file_to_process
     egs_rspecifier = 'ark,bg:nnet-chain-copy-egs --frame-shift .. scp:scp_rspecifier ark:- | \
                            nnet3-chain-shuffle-egs --buffer-size .. --srand .. ark:- ark:- | \
                            nnet3-chain-merge-egs --minibatch-size .. ark:- ark:- |'
     with SequentialNnetChainExampleReader(egs_rspecifier) as example_reader:
           for key, eg in example_reader:
                 batch = self.collate_fn(eg)
                 yield pseudo_epoch, batch


    • [ ] Split egs to more scp files (currently 56 files) to see whether it will get fine-grained in data_fraction of dropout or not.
    • [ ] Do further experiment and trace for approach copy|shuffle|merge in dataloader to confirm the bottleneck of this approach.
    • [ ] Profile first epoch of training to see why it takes so much time, as we can see now that first epoch training would take most part of time of the whole training time, no matter what approach (egs as dataset or dataloader) we use.
    opened by qindazhu 58
  • [src] CUDA Online/Offline pipelines + light batched nnet3 driver

    [src] CUDA Online/Offline pipelines + light batched nnet3 driver

    This is still WIP. Requires some cleaning, integrating the online mfcc into a separate PR (cf below), and some other things.

    Implementing a low-latency high-throughput pipeline designed for online. It uses the GPU decoder, the GPU mfcc/ivector, and a new lean nnet3 driver (including nnet3 context switching on device).

    • Online/Offline pipelines

    The online pipeline can be seen as taking a batch as input, and then processing a very regular algorithm of calling feature extraction, nnet3, decoder, and postprocessing on that same batch, in a synchronous fashion (i.e. all of those steps will run when DecodeBatch is called. Nothing is sent to some async pipelines along the way). What happens when you run DecodeBatch is very regular, and because of that it is able to guarantee some latency constraints (because the way the code will be executed is very predicable). It also focus on being lean, avoiding reallocations or recomputations (such as recompiling nnet3).

    The online pipeline takes care of computing [MFCC, iVectors], nnet3, decoder, postprocessing. It can either uses as input chunks of raw audio (and then compute mfcc->nnet3->decoder->postprocessing), or it can be called directly with mfcc features/ivectors (and then compute nnet3->decoder->postprocessing). The second possibility is used by the offline wrapper when use_online_ivectors=false.

    The old offline pipeline is replaced by a new offline pipeline which is mostly a wrapper around the online pipeline. What it does is having an offline-friendly API (accepting full utterances as input instead of chunks), and has the possibility to pre-compute ivectors on the full utterance first (use_online_ivectors = false). It then calls the online pipeline internally to compute most of the work.

    The easiest way to test the online pipeline end-to-end is to call it through the offline wrapper for now, with use_online_ivectors = true. Please note that ivectors will be ignored for now in this full end-to-end online (i.e. when use_online_ivectors=true). That's because the GPU ivectors are not yet ready for online. However the pipeline code is ready. The offline pipeline with use_online_ivectors=false should be fully functional and returns the same WER than before.

    • Light nnet3 driver designed for GPU and online

    It includes a new light nnet3 driver designed for the GPU. The key idea is that it's usually better to waste some flops to compute things such as partial chunks or partial batches. For example for the last chunk (nframes=17) of an utterance, that chunk can be smaller than max_chunk_size (50 frames per default). It that case compiling a new nnet3 computation for that exact chunk size is slower than just running it for a chunk size of 50 and ignoring the invalid output.

    Same idea for batch_size: The nnet3 computation will always run a fixed minibatch size. It is defined as minibatch_size = std::min(max_batch_size, MAX_MINIBATCH_SIZE). MAX_MINIBATCH_SIZE is defined to be large enough to hide the kernel launch latency and increase the arithmetic intensity of the GEMMs, but not larger, so that partial batches will not be slowed down too much (i.e. avoiding to run a minibatch of size 512 where only 72 utterances are valid). MAX_MINIBATCH_SIZE is currently 128. We'll then run nnet3 multiple time on the same batch if necessary. If batch_size=512, we'll run nnet3 (with minibatch_size=128) four times.

    The context-switch (to restore the nnet left and right context, and ivector) is done on device. Everything that needs context-switch is using the concept of channels, to be consistent with the GPU decoder.

    Those "lean" approaches gave us better performance, and a drop in memory usage (total GPU memory usage from 15GB to 4GB for librispeech and batch size 500). It also removes the need for "high level" multithreading (i.e. cuda-control-threads).

    • Parameters simplification

    Dropping some parameters because the new code design doesn't require them (--cuda-control-threads, the drain size parameter). In theory the configuration should be greatly simplified (only --max-batch-size needs to be set, others are optional).

    • Adding batching and online to GPU mfcc

    The code in cudafeat/ is modifying the mfcc GPU code. MFCC features can now be batched and processed online (restoring a few hundreds frames of past audio for each new chunk). That code was implemented by @mcdavid109 (thanks!). We'll create a separate PR for this, it requires some cleaning, and a large part of the code is redundant with existing mfcc files. GPU batched online ivectors and cmvn are WIP.

    • Indicative measurements

    When used with use_online_ivectors=false, that code reach 4,940 XRTF on librispeech/test_clean, with a latency around 6x realtime for max_batch_size=512 (latency would be lower with smaller max_batch_size). One use case where that GPU pipeline can be used in a situation where only latency matters (and not throughput) is for instance on the jetson nano, where some initial runs were measured at 5-10x realtime latency for a single channel (max_batch_size=1) on librispeech/clean. Those measurements are indicative only - more reliable measurements will be done in the future.

    opened by hugovbraun 56
  • Online2 NNet3 TCP server program

    Online2 NNet3 TCP server program

    Several people asked for this and I feel like it would be a nice addition to the project.

    The protocol is much simpler than the audio-server program I did a while ago - audio in -> text out.

    The way it's made now is nice for a live demo (I added some commands to the doxygen docs), but may still lack some features for real-life use.

    The main issue I have is that the new decoder is slightly different than before. The old decoder had a way to check which part of the output is final and which is "partial". This time, I can only check the current best path every N seconds (eg once per second of input audio). I use endpointing to determine when to finalize decoding.

    Now, what would be really nice it to have online speech detection and speaker diarization included with this, but I know it's probably not happening too soon. What can be done (and I may do it myself if I find time) is a multithreaded version of the program with shared acoustic model and FST. Also, I bet it could be possible to combine the grammar version with the server to allow runtime vocabulary modification.

    I also have a web interface that works with this server, but I'm not sure if it would fit the main Kaldi repo, so I'll probably make a separate repo for that (if anyone wants it).

    I'm open to comments and suggestions.

    opened by danijel3 56
  • Xvectors: DNN Embeddings for Speaker Recognition

    Xvectors: DNN Embeddings for Speaker Recognition

    Overview This pull request adds xvectors for speaker recognition. The system consists of a feedforward DNN with a statistics pooling layer. Training is multiclass cross entropy over the list of training speakers (we may add other methods in the future). After training, variable-length utterances are mapped to fixed-dimensional embeddings or “xvectors” and used in a PLDA backend. This is based on http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf, but includes recent enhancements not in that paper, such as data augmentation.

    This PR also adds a new data augmentation script, which is important to achieve good performance in the xvector system. It is also helpful for ivectors (but only in PLDA training).

    This PR adds a basic SRE16 recipe to demonstrate the system. An ivector system is in v1, and an xvector system is in v2.

    Example Generation An example consists of a chunk of speech features and the corresponding speaker label. Within an archive, all examples have the same chunk-size, but the chunk-size varies across archives. The relevant additions:

    • sid/nnet3/xvector/get_egs.sh — Top-level script for example creation
    • sid/nnet3/xvector/allocate_egs.py — This script is responsible for deciding what is contained in the examples and what archives they belong to.
    • src/nnet3bin/nnet3-xvector-get-egs — The binary for creating the examples. It constructs examples based on the ranges.* file.

    Training This version of xvectors is trained with multiclass cross entropy (softmax over the training speakers). Fortunately, steps/nnet3/train_raw_dnn.py is compatible with the egs created here, so no new code is needed for training. Relevant code:

    • sre16/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh — Does example creation, creates the xconfig, and trains the nnet

    Extracting XVectors After training, the xvectors are extracted from a specified layer of the DNN after the temporal pooling layer. Relevant additions

    • sid/nnet3/xvector/extract_xvectors.sh — Extracts embeddings from the xvector DNN. This is analogous to extract_ivectors.sh.
    • src/nnet3bin/nnet3-xvector-compute — Does the forward computation for the xvector DNN (variable-length input, with a single output).

    Augmentation We’ve found that embeddings almost always benefit from augmented training data. This appears to be true even when evaluated on clean telephone speech. Relevant additions:

    • steps/data/augment_data_dir.py — Similar to reverberate_data_dir.py but only handles additive noise.
    • egs/sre16/v1/run.sh — PLDA training list is augmented with reverb and MUSAN audio
    • egs/sre16/v2/run.sh — DNN training and PLDA list are augmented with reverb and MUSAN.

    SRE16 Recipe The PR includes a bare bones SRE16 recipe. The goal is primarily to demonstrate how to train and evaluate an xvector system. The version in egs/sre16/v1/ is a straightforward i-vector system. The recipe in egs/sre16/v2 contains the DNN embedding recipe. Relevant additions

    • egs/sre16/v1/local/ — A bunch of dataprep scripts
    • egs/sre16/v2/local/nnet3/xvector/prepare_feats_for_egs.sh -- A script that applies cmvn and removes silence frames and writes the results to disk. This is what the nnet examples are generated from.
    • egs/sre16/v1/run.sh — ivector top-level script
    • egs/sre16/v2/run.sh — xvector top-level script

    Results for this example:

      xvector (from v2) EER: Pooled 8.76%, Tagalog 12.73%, Cantonese 4.86%
      ivector (from v1) EER: Pooled 12.98%, Tagalog 17.8%, Cantonese 8.35%

    Note that the recipe is somewhat "bare bones." We could improve the results for the xvector system further by adding even more training data (e.g., Voxceleb: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/). Both systems would improve from updates to the backend such as adaptive score normalization or more effective PLDA domain adaptation techniques. However, I believe that is orthogonal to this PR.

    opened by david-ryan-snyder 55
  • Memory

    Memory "leak" of cudadecoder's arc instantiations

    Hi, I have recently been trying to track down progressive memory growth in Triton's Kaldi backend (https://github.com/NVIDIA/DeepLearningExamples/issues/1240), and in pursuit of that I've successfully tried to reproduce the issue with a bare Kaldi setup.

    I don't have any understanding of Kaldi's internals, so some of the information given here might seem vague or might be outright non sensical but I hope it gives the general idea.

    Basically, the issue seems to be that the cudadecoder keeps max_active number of arc instantiations per every computation of an audio chunk (frame computation), that never seems to get freed until the decoder's destructor is called.

    As far as I can understand logically, the arc instantiations are relevant only for a given correlation id / audio stream, and there is no meaningful way to use these instantiations to improve the accuracy of other, unrelated audio streams / correlation ids. So, it seems fair to expect that all arc instantiations relating to a given correlation ID get freed once the last chunk for that ID has been processed. However this doesn't seem to happen practically.

    This becomes a huge problem in the Triton Kaldi backend since it constantly takes in new inputs from clients, and the memory usage climbs rapidly with every inference (reaching up to 30G for large WAVs)

    Steps to reproduce:

    Use this shell script to launch an inference for the LibriSpeech dataset:

    # --max-active=10
    /bin/time -v ./batched-wav-nnet3-cuda-online \
        --max-batch-size=1100 \
        --cuda-use-tensor-cores=true \
        --cuda-worker-threads=12 \
        --cuda-decoder-copy-threads=4 \
        --print-hypotheses \
        --cuda-use-tensor-cores=true \
        --main-q-capacity=30000 \
        --aux-q-capacity=400000 \
        --beam=10 \
        --cuda-worker-threads=10 \
        --num-channels=4000 \
        --lattice-beam=7 \
        --max-active=10000 \
        --frames-per-chunk=50 \
        --acoustic-scale=1.0 \
        --config=/data/models/LibriSpeech/conf/online.conf \
        --word-symbol-table=/data/models/LibriSpeech/words.txt \
        /data/models/LibriSpeech/final.mdl \
        /data/models/LibriSpeech/HCLG.fst \
        scp:/data/datasets/LibriSpeech/test_clean/wav_conv.scp \
        'ark:|gzip -c > /tmp/lat.gz'

    Notice that the memory usage keeps climbing and remains constant after all the inferences have been performed. It only gets freed once the whole decoder object is destroyed. The expected behaviour is that the memory usage keeps fluctuating up and down as a consequence of properly releasing the memory for the arc instantiations of the correlation IDs that have been completely inferred.

    The program's memory usage caps out at around 6G in case of max_active=10:

    Command being timed: "./batched-wav-nnet3-cuda-online --max-batch-size=1100 ... **--max-active=10** ... ark:|gzip -c > /tmp/lat.gz"
            User time (seconds): 30.66
            Maximum resident set size (kbytes): **5989872**

    I'm showing Maximum resident set size (i.e. the peak memory usage) because the usage actually never goes down after peaking due to the leak. This can be confirmed by adding a sleep before return-ing here: https://github.com/kaldi-asr/kaldi/blob/master/src/cudadecoderbin/batched-wav-nnet3-cuda-online.cc#L316

    And at 8G in case of max_active=10000:

    Command being timed: "./batched-wav-nnet3-cuda-online --max-batch-size=1100 ... **--max-active=10000** ... ark:|gzip -c > /tmp/lat.gz"             
            User time (seconds): 29.87
            Maximum resident set size (kbytes): **8204936**

    This correlation between the memory usage and the value of max_active led me to believe that the arc instantiations are not being freed as soon as a given correlation ID's last chunk has been processed.

    opened by git-bruh 5
  • Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file

    Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file

    anyone who can help me regarding this error when i execute run.sh file


    steps/make_mfcc.sh --nj 1 --cmd run.pl data/train exp/make_mfcc/train mfcc utils/validate_data_dir.sh: Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.G3YQ/utts 2022-12-15 15:47:53.033696862 +0000 +++ /tmp/kaldi.G3YQ/utts.utt2dur 2022-12-15 15:47:53.053696720 +0000 @@ -2,28 +2,4 @@ spk1_10 -spk1_100 -spk1_101 ... [Lengths are /tmp/kaldi.G3YQ/utts=435 versus /tmp/kaldi.G3YQ/utts.utt2dur=142] steps/make_mfcc.sh --nj 1 --cmd run.pl data/test exp/make_mfcc/test mfcc utils/validate_data_dir.sh: Error: in data/test, utterance-ids extracted from utt2spk and utt2dur file utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.uq6T/utts 2022-12-15 15:47:53.089696463 +0000 +++ /tmp/kaldi.uq6T/utts.utt2dur 2022-12-15 15:47:53.105696349 +0000 @@ -2,28 +2,4 @@ spk1_10 -spk1_100 -spk1_101 ... [Lengths are /tmp/kaldi.uq6T/utts=435 versus /tmp/kaldi.uq6T/utts.utt2dur=142] steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc steps/compute_cmvn_stats.sh: no such file data/train/feats.scp steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc steps/compute_cmvn_stats.sh: no such file data/test/feats.scp

    opened by ShakeelOfficials 0
  • Faster Cuda Decoder

    Faster Cuda Decoder

    There were several issues recently discovered with the cuda decoder in both offline and online mode.

    After my fixes, I can achieve 7800 RTFx throughput on librispeech test-clean and the model https://kaldi-asr.org/models/m13 with an A100-80GB PCIe card in the offline mode of computation. Previously, because of some unnoticed software regressions, this number was as low as 4000 RTFx, which isn't bad, admittedly.

    Latency is more complicated, but here is a preliminary result with this model https://kaldi-asr.org/models/m13 on librispeech test-clean:


    This was achieved via the following hyperparameter sweep:

    for chunk_size in 21 30 40 50; do
        for num_streaming_channels in 1000 2000 3000 4000 5000 6000; do
            max_batch_size=$((num_streaming_channels>4000 ? 4000 : num_streaming_channels))
            /home/dgalvez/scratch/code/asr/kaldi-a100-perf//src/cudadecoderbin/batched-wav-nnet3-cuda-online --num-channels=$((num_streaming_channels * 2)) --cuda-use-tensor-cores=true --main-q-capacity=30\
    000 --aux-q-capacity=400000 --cuda-memory-proportion=0.5 --max-batch-size=$max_batch_size --cuda-worker-threads=12 --file-limit=-1 --cuda-decoder-copy-threads=4 --batching-copy-threads=8 --frame-subsam\
    pling-factor=3 --frames-per-chunk=$chunk_size --max-mem=100000000 --beam=10 --lattice-beam=7 --acoustic-scale=1.0 --determinize-lattice=true --max-active=10000 --iterations=10 --file-limit=-1 --config=\
    /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//conf/online.conf --num-parallel-streaming-channels=$num_streaming_channels --word-symbol-table=/home/dgalvez/scratch/code/a\
    sr/kaldi-a100-perf/workspace//models/LibriSpeech//words.txt /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//final.mdl /home/dgalvez/scratch/code/asr/kaldi-a100-perf/worksp\
    ace//models/LibriSpeech//HCLG.fst scp:/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//datasets/LibriSpeech/test_clean//wav_conv.scp 'ark:|gzip -c > /tmp/results/LibriSpeech/52/0/lat.gz' # 2> \
            cat output.log | grep -A 1 "Latencies" | grep -v "Latencies" | awk 'BEGIN { OFS = ","; ORS = ""} {print $3,$4,$5,$6}' >> $result_file
            echo ",${chunk_size},${num_streaming_channels},${max_batch_size}" >> $result_file

    Do note that better results can be achieved sometimes by setting maximum batch size lower than the number of channels. Average latency is, of course, much smaller. This means users can do real-time decoding at 3000-4000 audio streams concurrently.

    This is the "compute" latency. It doesn't include the time spent waiting for the right hand context (21 frames, or 210 ms in this case). The point is that it is incredibly fast.

    opened by galv 11
  • SRILM: allow bypassing download/extraction during automated installation

    SRILM: allow bypassing download/extraction during automated installation

    The SRILM website download procedure seems to have been broken for a while. This PR allows you to bypass downloading the archive from the SRI website and/or extracting the archive into the source tree (if you have either obtained via other means), while still taking advantage of the rest of the automated installation script from Kaldi.

    opened by daanzu 0
A state-of-the-art automatic speech recognition toolkit
Official PyTorch Code of GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection (CVPR 2021)

GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Mo

Abhinav Kumar 76 Jan 2, 2023
The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

Chenyu 109 Dec 23, 2022
Official page of "Patchwork: Concentric Zone-based Region-wise Ground Segmentation with Ground Likelihood Estimation Using a 3D LiDAR Sensor"

Patchwork Official page of "Patchwork: Concentric Zone-based Region-wise Ground Segmentation with Ground Likelihood Estimation Using a 3D LiDAR Sensor

Hyungtae Lim 252 Dec 21, 2022
The official implementation of the research paper "DAG Amendment for Inverse Control of Parametric Shapes"

DAG Amendment for Inverse Control of Parametric Shapes This repository is the official Blender implementation of the paper "DAG Amendment for Inverse

Elie Michel 157 Dec 26, 2022
The official Brainfuckn't esolang

Brainfuckn't Backstory Brainfuckn't is an esolang created by me (4gboframram) that is similar to brainfuck, but definitely isn't. The name came from a

null 1 Nov 7, 2021
Praprotem Official Repository.

Praprotem V1.0.0 Praprotem Official Repository. Praprotem is a project management system being built to help users easily manage all projects from one

Praise Codes 2 Nov 19, 2021
Official Pytorch implementation of RePOSE (ICCV2021)

RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering (ICCV2021) [Link] Abstract We present RePOSE, a fast iterative refinement method for

Shun Iwase 68 Nov 15, 2022
Official Code for StyleMesh

StyleMesh This is the official repository that contains source code for StyleMesh. [Arxiv] [Project Page] [Video] If you find StyleMesh useful for you

Lukas Hoellein 100 Dec 28, 2022
This fork adds enhancements to the Loz project (Legend of Zelda remake).

LOZ This project is a remake of the game Legend of Zelda. Summary The repository is split into a game project and two tools. The tools extract resourc

Aldo Núñez 26 Nov 29, 2022
this is a repo of where you will get to see tic tac toe AI intregrated project

?? Tic-Tac-Toe-AI-Intregrated ?? What is the meaning of AI Intregrated ??‍♀️ ??‍♂️ ❓ ❓ You all have Played Tic Tac Toe in your life if you don't know

Ujjwal 18 Dec 5, 2022
Super Mario Remake using C++, SFML, and Image Processing which was a project for Structure Programming Course, 1st Year

Super Mario Remake We use : C++ in OOP concepts SFML for game animations and sound effects. Image processing (Tensorflow and openCV) to add additional

Omar Elshopky 5 Dec 11, 2022
This is a small example project, that showcases the possibility of using a surrogate model to estimate the drag coefficient of arbitrary triangles.

flowAroundTriangles This is a small example project, that showcases the possibility of using a surrogate model to estimate the drag coefficient of arb

null 6 Sep 16, 2022
A project to control Petoi Bittle using human pose

Petoi Bittle Controlled by Pose A project to control Petoi Bittle by human pose Human pose is estimated using MoveNet and TensorFlow Lite YouTube Syst

iwatake 11 Dec 26, 2021
This is a sample ncnn android project, it depends on ncnn library and opencv

This is a sample ncnn android project, it depends on ncnn library and opencv

null 248 Jan 6, 2023
A project demonstration on how to use the GigE camera to do the DeepStream Yolo3 object detection

A project demonstration on how to use the GigE camera to do the DeepStream Yolo3 object detection, how to set up the GigE camera, and deployment for the DeepStream apps.

NVIDIA AI IOT 9 Sep 23, 2022
OpenSpeaker is a completely independent and open source speaker recognition project.

OpenSpeaker is a completely independent and open source speaker recognition project. It provides the entire process of speaker recognition including multi-platform deployment and model optimization.

ZY 34 Nov 20, 2022
Mirror of compiler project code. Not for SVC purpose.

Compiler-proj Project progress is updated here. Progress 2021/11/28: Started! Set up Makefile and finished basic scanner. 2021/10/24: Repo created. Ac

Yuheng 0 Dec 23, 2021
This is my first 42 cursus project. I'm going to remake some built-in c functions so I can reuse them.

lib-ft This is my first 42 cursus project. I'm going to remake some built-in c functions so I can reuse them in my cursus. "Read the subject.pdf for m

Mhamed Ajjig 5 Nov 15, 2022