Real time monaural source separation base on fully convolutional neural network operates on Time-frequency domain.


Real time monaural source separation base on fully convolutional neural network operates on time-frequency domain

AI Source separator written in C running a U-Net model trained by Deezer, separate your audio input to Drum, Bass, Accompaniment and Vocal/Speech with Spleeter model.

Build Instructions


  • Visual Studio 2019
  • Intel MKL library - 2019 Update 5
  • JUCE 6.x (but any version should work)
  • Extract model.7z inplace (for offline executable)

Run git clone

Open the .projucer file.


Network overview

The network accepts 2 channels magnitude spectrogram as input, U-Net is constructed using 6 pairs of encoder/decoder, final dilated convolution layer expand second last feature map into 2 channels for stereo inference.

For 4 stem track separation, we need 4 networks to achieve separation, the neural network computes probability mask function as final output.

The encoder uses convolutional layer with stride = 2, reduce the need for max pooling, a great improvement for a real-time system.

Batch normalization and activation is followed by the output of each convolution layer except the bottleneck of U-Net.

The decoder uses transposed convolution with stride = 2 for upsampling, with their input concatenated with each encoder Conv2D pair.

Worth notice, batch normalization and activation isn't the output of each encoder layers we are going to concatenate. The decoder side concatenates just the convolution output of the layers of an encoder.

Fast neural network inference

Deep learning inference is all about GEMM, we have to implement im2col() function with stride, padding, dilation that can handle TensorFlow-styled CNN or even Pytorch-styled convolutional layer.

We also need col2im() function with stride and padding for transposed convolutional layers.

After the construction of the model in C, the test run show promising performance, we process 14 seconds song within 600 ms wall clock, numeric accuracy is about 1e-4 MSE of TensorFlow model, indicate the architecture is correct.

I don't plan to use libtensorflow, I'll explain why.

Deep learning functions in existing code: im2col(), col2im(), gemm(), conv_out_dim(), transpconv_out_dim()

We have to initialize a buck of memory and spawn some threads before processing begins, we allow developers to adjust the number of frequency bins and time frames for the neural network to inference, the official Spleeter set FFTLength = 4096, Flim = 1024 and T = 512 for default CNN input, then the neural network will predict mask up to 11kHz and take about 10 secs.

Real time system design for the VST

Which mean real-world latency of default setting using official model will cost you 11 secs + overlap-add sample latency, no matter how fast your CPU gets, the sample latency is intrinsical.

I decide to reduce the time-frequency frames stack to a a quarter to the original, that means we modify T to 128 at the cost of a slightly inaccurate result.

However, all this reduce sample latency but doesn't solve the fact that our system is lagging, because each deep learning function call cost 600 ms on my system, it is like we stopping the audio pipeline for 600 ms for the CNN.

Ok, then we have to go for the double-buffered design.

The 2D image buffering mechanism is double-buffered, that means we collect 128 frames, output last 128-256 frame, compute 1-128 frame in the background threads, we use samples latency to trade computation workload, result in 6 seconds latency, still lower than official default setting.

The program spawns 5 threads, we got 1 thread for FFT, T-F masking, IFFT, overlapping, while 4 other threads are actively doing deep learning task in the background.

We got 4 sources to demix, we run 4 CNN in parallel, each convolutional layer gemm() is sequential.

Can we go even shorter intrinsical latency?

Yes, but would start getting inefficient, the only way seem to be puting every present frame to the right hand side of spectrogram and sliding every context/history frames to the left hand side of spectrogram, because keeping more context/history frames will result in more reliable separation.

This way we run a lot of redundant frames that contribute to almost nothing to the mask except context information they provide.

Blazingly fast offline inference

How fast for the offline version?

Well, if you've linked the program with Intel MKL, I'm sure this will go as fast as a Tensorflow-GPU inference, running SpleeterRT on recent desktop CPU model will offer you faster-than Tensorflow-GPU inference.

How is that possible?

Multi-threading the inference of each frames will definitely offer you superior parallelism over official Spleeter Tensorflow, I also parallelize STFT computation for offline version, our performance out-perform official Spleeter Tensorflow on almost every level.

Notice that, our offline weights is quantized(will be decompress on execution) and can separate only [Accompaniment and Vocal] OR [Drum, Accompaniment and Vocal], since I thought Bass guitar separation is not useful and waste a lot of computation.

What is the cost for this?

The cost is pretty obvious, I didn't compute the ratio mask from final spectrogram, so the quality is degraded for this only reason(Compare to official version), without computing ratio mask save half computation amount for 2stem case.

The result will be identical to official Spleeter if you add back the ratio mask computing process.

Demo and screenshot

Mixture [0 - 15000 Hz]


Vocal [0 - 15000 Hz]


Accompaniment [0 - 15000 Hz]


Drum [0 - 15000 Hz]


Bass (guitar) [0 - 1200 Hz]


VST plugin in action

Adobe Audition

System Requirements and Installation

Currently, the UI is implemented using JUCE with no parameters can be adjusted.

Any audio plugin host that is compilable with JUCE will run the program.

Win32 API are used to find user profile directory to fread the deep learning model.

Other than that, the source separator should able to run Linux and macOS too.

I haven't investigated the GEMM function for ARM device, if anyone finds good implementation, perhaps make a pull request to gemm.c.

Memory consumption is a big deal for mobile phone, 4 stems model uses 700Mb RAM, 2 stems model uses around 300Mb RAM. I have no plan for supporting 4 stems for mobile phone.


  1. Why not just go for libtensorflow for everything, TensorFlow is static computation graph-based, should be suitable for audio inference?

A: There is a couple of reason for that. Python program that related to the Time-frequency transform and CNN entry point must be rewritten before generate a useful Tensorflow freezed model(.pb), no matter how static the computation graph is. But why?

Simply because Spleeter official Tensorflow model pack 4 stems/CNN into same checkpoint file, 4 stems are totally sequential/series, the reason Tensorflow model run so fast, they uses SIMD instructions.

Other than that, Tensorflow doesn't parallel the code path you want.

You need to write a Python program, you will going to split the checkpoint of 4 stems model into 4 individual freezed graph(.pb) and then use libtensorflow API to call 4 individual sessions on each thread.

  1. The audio processor is so slow, slower than Python version on the same hardware.

A: Not really, the plugin isn't like official Spleeter, we can't do everything in offline, there's a big no to write a real-time signal processor that run in offline mode, online separation give meaning to this repository.

The audio processor buffering system will cost extra overhead to process compared to offline Python program.

At the same time, the FFT implementation isn't the best of course, but definitely comparable to Tensorflow.

Different audio plugin host or streaming system have different buffer size, the system with a small buffer will definitely make the processing slower.


Other than the project main components are GPL-licensed, I don't know much about Intel MKL.


Deezer, of cource, this repository won't happen without their great model.

Intel MKL, without MKL, the convolution operation run 40x slower.

  • Running on ARM micro controller

    Running on ARM micro controller

    Hi Thanks for your good work, and congratulations. I'm trying to figure out if it would be possible to run Vocals / Music separation in real time, with acceptable latency, on an ARM micro controller. From what I read in your description, we need 300 Mb of RAM to use the algorithm ? Does this includes the audio buffers to capture incoming audio data while processing a previous block of data ? Or are those 300 Mb only for Spleeter ? Also 300 Mbits or 300 Mbytes ? Thanks Jerome

    opened by jeromeDms 34
  • Compare SpleeterRT, Spleeter And Original Spectrum

    Compare SpleeterRT, Spleeter And Original Spectrum

    SpleeterRT is very different from the original spectrum, but Deezer Spleeter is very close to the original spectrum. Why? There are something wrong with SpleeterRT?

    opened by Jongan 2
  • crash moments after starting plugin

    crash moments after starting plugin

    Tested with audacity and VSTHost, in both cases the plugin does not work correctly, and crashes a short while after it is loaded. Unsure what is causing the issue, but the coefficients are present.

    I would also like to see if it's possible to run this with EqualizerAPO, and I know that this might be a seperate issue, as it runs the plugins as the audio service user. (Even ignoring that potential issue, it also crashes the same way there).

    opened by micsthepick 2
  • GEMM function for ARM device

    GEMM function for ARM device

    I haven't investigated the GEMM function for ARM device, if anyone finds good implementation, perhaps make a pull request to gemm.c.

    May be this link will helpfull

    Screenshot 2021-11-07 at 13 09 44
    opened by shalom-aviv 1
  • about batch normalization

    about batch normalization

    Great works! I read your code, but I am confused about the batch norm operation. I think batch norm in model prediction is (x-moving_mean)/moving_var * gamma + beta., while in the code is batchNorm[ _ + s] * val + batchNorm[ _ ]. I am not sure if it is correct.

    opened by yansd-c 1
  • Any luck with this?

    Any luck with this?

    Hi James,

    A buddy and I are embarking on working on something very similar to what it seems you worked on, getting Spleeter to run in real time on Raspberry Pi (ARM, low latency, real time, etc). Did you ever have any luck with this, or did you hit a wall? Any guidance you could provide would be much appreciated, and if it makes sense to create a consulting arrangement, we'd be interested in discussing that as well. Thanks in advance!


    opened by millimetre 1
  • about the win function applied in stft

    about the win function applied in stft

    Hello, again. I write down analysisWnd and synthesisWnd and drawed their curves. I cannot put a picture here. but it seems that these two win function looks strange. AnalysisWnd is not symmetric and synthesisWnd has two peaks.

    I read the original spleeter code. It seems it uses hanning win as its stft win function.

    opened by yansd-c 1
  • v0.2-alpha(Jul 18, 2020)

    coefficients.7z contains 4 neural networks model files, they are for 2stems and 4stems VST DLL to read. For 4stems VST: The quality of _256 is much higher than _128 and the separation is much stable, however, the latency of 128 is half of the 256. For 2stems VST: The quality is somehow similar for _128 and _256, however, _64 offer minimum latency, quality is not bad at all, this version is highly optimized and close source.

    Windows Installation:

    1. Extract all files inside coefficients.7z to C:\Users\YOUR PROFILE->current step is only required for 4stems
    2. Extract Spleeter_Win.7z to whatever your VST hosts search path is.
    3. Load the VST .dll in any VST hosts.

    Tested host: Adobe Audition, Audacity(x86), foobar2000 VST adapter.

    Caution, x86 4stems VST are likely to get OOM and crashed, that's why I will not support VST3 4stems, VST3 usually cost more memory.

    Mac Installation:

    1. Install Intel MKL (Important, I suspect the .vst was dynamic linked)
    2. Extract all files inside coefficients.7z to /Users/YOUR HOME DIRECTORY
    3. Extract to /Users/YOUR HOME DIRECTORY/Library/Audio/Plug-Ins/VST/ is.
    4. Load the VST .dll in any VST hosts.

    Tested host: Adobe Audition(x64), Audacity(x64).

    Intrinsic latency of algorithm with STFT scheme: 256 -> ((F / Lap) * T * BufFactor) / Fs -> ((4096 / 4) * 256 * 2) / 44100 -> 11.8886 secs 128 -> ((F / Lap) * T * BufFactor) / Fs -> ((4096 / 4) * 128 * 2) / 44100 -> 5.9443 secs 64 -> ((F / Lap) * T * BufFactor) / Fs -> ((4096 / 4) * 64 * 2) / 44100 -> 2.9721 secs


    2.9721 secs latency would be the lowest latency deep learning model-based monaural source separation algorithm I've ever seen! Since Spleeter is STFT image segmentation network, we don't have to deal with time frame boundary effect like Demucs and there is very high chance to get nonlinear distortion from output like Wave-U-Net and Demucs. Spleeter may have boundary effect, but not in the way like time domain "adding" impulse to your signal.

    Above instructions was design for online(VST) version, for the offline version(SpleeterRT_windows_offline.7z) You just need to extract the .exe file on elsewhere and use CLI to playaround.

    Source code(tar.gz)
    Source code(zip)
    coefficients.7z(135.79 MB)
    SpleeterRT_windows_offline.7z(34.53 MB) MB)
    Spleeter_Win.7z(41.39 MB)
  • v0.1-alpha(Jul 7, 2020)

    .7z package contains 2 DLL file, Spleeter4Stems_128.dll and Spleeter4Stems_256.dll are the same in functionality. The quality of _256 is much higher than _128 and the separation is much stable, however, the latency of 128 is half of the 256.


    1. Extract .7z
    2. Copy accompaniment4stems.dat, bass4stems.dat, drum4stems.dat, vocal4stems.dat to C:\Users\YOUR PROFILE
    3. Load the VST .dll in any VST hosts.

    Tested host: Adobe Audition, Audacity, foobar2000 VST adapter.

    Source code(tar.gz)
    Source code(zip)
    VST_win32_x86.7z(137.12 MB)
James Fung
Signal processing, Blind source separation🔍, Real time Machine learning solution provider
James Fung
International Business Machines 10 Dec 20, 2022
Convolutional Neural Networks

Darknet Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation. D

Joseph Redmon 23.7k Jan 9, 2023
Real-Time Neural 3D Hand Pose Estimation from an Event Stream [ICCV 2021]

EventHands: Real-Time Neural 3D Hand Pose Estimation from an Event Stream Project Page Index -- how to train the model from scratch EVAL_REAL

null 23 Nov 7, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 93 Dec 10, 2022
simple neural network library in ANSI C

Genann Genann is a minimal, well-tested library for training and using feedforward artificial neural networks (ANN) in C. Its primary focus is on bein

Lewis Van Winkle 1.3k Dec 29, 2022
oneAPI Deep Neural Network Library (oneDNN)

oneAPI Deep Neural Network Library (oneDNN) This software was previously known as Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-

oneAPI-SRC 3k Jan 6, 2023
DyNet: The Dynamic Neural Network Toolkit

The Dynamic Neural Network Toolkit General Installation C++ Python Getting Started Citing Releases and Contributing General DyNet is a neural network

Chris Dyer's lab @ LTI/CMU 3.3k Dec 31, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

NeuroSim 32 Nov 24, 2022
ffcnn is a cnn neural network inference framework, written in 600 lines C language.

+----------------------------+ ffcnn 卷积神经网络前向推理库 +----------------------------+ ffcnn 是一个 c 语言编写的卷积神经网络前向推理库 只用了 500 多行代码就实现了完整的 yolov3、yolo-fastes

ck 54 Dec 28, 2022
Ncnn version demo of [CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search (ncnn) The official implementation by pytorch: ht

null 34 Dec 26, 2022
ncnn is a high-performance neural network inference framework optimized for the mobile platform

ncnn ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployme

Tencent 16.2k Jan 5, 2023
Cranium - 🤖 A portable, header-only, artificial neural network library written in C99

Cranium is a portable, header-only, feedforward artificial neural network library written in vanilla C99. It supports fully-connected networks of arbi

Devin Soni 543 Dec 25, 2022
Raspberry Pi guitar pedal using neural networks to emulate real amps and pedals.

NeuralPi NeuralPi is a guitar pedal using neural networks to emulate real amps and pedals on a Raspberry Pi 4. The NeuralPi software is a VST3 plugin

Keith Bloemer 865 Jan 5, 2023
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Yan haixu 191 Jan 3, 2023
Anomaly Detection on Dynamic (time-evolving) Graphs in Real-time and Streaming manner

Anomaly Detection on Dynamic (time-evolving) Graphs in Real-time and Streaming manner. Detecting intrusions (DoS and DDoS attacks), frauds, fake rating anomalies.

Stream-AD 696 Dec 18, 2022
NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

Real-ESRGAN ncnn Vulkan This project is the ncnn implementation of Real-ESRGAN. Real-ESRGAN ncnn Vulkan heavily borrows from realsr-ncnn-vulkan. Many

Xintao 602 Jan 6, 2023
🐸 Coqui STT is an open source Speech-to-Text toolkit which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers

Coqui STT ( ?? STT) is an open-source deep-learning toolkit for training and deploying speech-to-text models. ?? STT is battle tested in both producti 1.7k Jan 2, 2023
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Project DeepSpeech DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Spee

Mozilla 20.8k Jan 9, 2023
Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for real-time gesture recognition.

Gesture Recognition Toolkit (GRT) The Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for re

Nicholas Gillian 793 Dec 29, 2022