# Rolling Quantiles for NumPy

## Hyper-efficient and composable filters.

- Simple, clean, intuitive interface.
- Supports streaming data or bulk processing.
- Python 3 bindings for a lean library written in pure C.

### A Quick Tour

Let me give you but a superficial overview of this module's elegance.

```
import numpy as np
import rolling_quantiles as rq
pipe = rq.Pipeline( # rq.Pipeline is the only stateful object
# declare a cascade of filters by a sequence of immutable description objects
rq.LowPass(window=201, portion=100, subsample_rate=2),
# the above takes a median (101th element out of 201) of the most recent 200
# points and then spits out every other one
rq.HighPass(window=10, portion=3)
# that subsampled rolling median is then fed into this filter that takes a
# 30% quantile on a window of size 10, and subtracts it from its raw input
# the pipeline exposes a set of read-only attributes that describe it
pipe.lag # = 60.0, the effective number of time units that the real-time output
# is delayed from the input
pipe.stride # = 2, how many inputs it takes to produce an output
# (>1 due to subsampling)
input = np.random.randn(1000)
output = pipe.feed(input) # the core, singular exposed method
# every other output will be a NaN to demarcate unready values
subsampled_output = output[1::pipe.stride]
```

That may be a lot to take in, so let me break it down for you:

`rq.Pipeline(description...)`

constructs a filter pipeline from one or more filter descriptions and initializes internal state.`.feed(*)`

takes in a Python number or`np.array`

and its output is shaped likewise.- The two filter types are
`rq.LowPass`

and`rq.HighPass`

that compute rolling quantiles and return them as is, and subtract them from the raw signal respectively. Compose them however you like! `NaN`

s in the output purposefully indicate missing values, usually due to subsampling. If you pass a`NaN`

into a`LowPass`

filter, it will slowly deplete its reserve and continue to return valid quantiles until the window empties completely.`rq.LowPass`

and`rq.HighPass`

alternatively take in a`quantile=q`

argument,`0<=q<=1`

. The filters would perform a linear interpolation in this case. In order to control the statistical characteristics of this quantile estimate, parameters`alpha`

and`beta`

are exposed as well with default values`(1, 1)`

. Refer to SciPy's documentation for details on this aspect.

I also expose a convenience function `rq.medfilt(signal, window_size)`

at the top-level of the package to directly supplant `scipy.signal.medfilt`

.

That's it! I detailed the entire library. Don't let the size of its interface fool you!

## Installation

If you are running Linux, MacOS, or Windows with Python 3.8+ and NumPy ~1.20, execute the following:

`pip install rolling-quantiles`

These are the conditions under which binaries are built and sent to the Python Package Index, which holds `pip`

's packages. Should the NumPy version be unsuitable, for instance, I suggest building the package from source. This is rather straightforward because the handful of source files in C have absolutely minimal dependencies.

### Building from Source

The meat of this package is a handful of C files with no external dependencies, besides NumPy 1.16+ and Python 3.7+ for the bindings located in `src/python.c`

. As such, you may build from source by running the following from the project's root directory:

`cd python`

- Check
`pyproject.toml`

to make sure the listed NumPy version matches your desired target. `python -m build`

(make sure this invokes Python 3)`pip install dist/<name_of_generated_wheel_file>.whl`

#### Note of Caution on MacOS Big Sur

Make sure to specify `MACOSX_DEPLOYMENT_TARGET=10.X`

as a prefix to the build command, e.g. `python -m build`

. The placeholder `X`

can be any MacOS version earlier than Big Sur (I use `9`

.) By default, the build system would attempt to build for MacOS 11 that is incompatible with current Python interpreters that have been compiled against a prior version.

### Benchmarking a median filter on 100 million doubles.

I make use of binary heaps that impart desirable guarantees on their amortized runtime. Realistically, their performance may depend on the statistics of the incoming signal. I pummeled the filters with Gaussian Brownian motion to gauge their practical usability under a typical drifting stochastic process.

`window` |
`rolling_quantiles` [1] |
`scipy` [2] |
`pandas` [3] |
---|---|---|---|

4 | 14 seconds | 22 seconds | 25 seconds |

10 | 21 seconds | 47 seconds | 31 seconds |

20 | 28 seconds | 95 seconds | 35 seconds |

30 | 30 seconds | 140 seconds | 37 seconds |

40 | 34 seconds | 190 seconds | 40 seconds |

50 | 36 seconds | 242 seconds | 40 seconds |

1,000 | 61 seconds | N/A | 62 seconds |

Likewise, with simulated Gaussian white noise (no drift in the signal):

`window` |
`rolling_quantiles` [1] |
`scipy` [2] |
`pandas` [3] |
---|---|---|---|

4 | 14 seconds | 22 seconds | 25 seconds |

10 | 20 seconds | 51 seconds | 31 seconds |

20 | 25 seconds | 105 seconds | 36 seconds |

30 | 27 seconds | 156 seconds | 39 seconds |

40 | 30 seconds | 218 seconds | 41 seconds |

50 | 30 seconds | 279 seconds | 42 seconds |

1,000 | 45 seconds | N/A | 70 seconds |

Intel(R) Core(TM) i7-8700T CPU @ 2.40GHz, single-threaded performance on Linux. My algorithm looked even better (relative to pandas) on a 2020 MacBook Pro. Check out this StackOverflow answer for a particular use case.

[1] `rq.Pipeline(...)`

[2] `scipy.signal.medfilt(...)`

[3] `pd.Series.rolling(*).quantile(...)`