A C++17 library of computationally efficient methods for calculating sample statistics

Overview

Vectorized statistics using SIMD primitives


Introduction

is a C++17 library of computationally efficient methods for calculating sample statistics (mean, variance, covariance, correlation).

  • the implementation builds upon the SIMD abstraction layer provided by the Vector Class Library [1]
  • it uses a data-parallel Youngs and Cramer [2] algorithm for numerically stable computations of sums and sums-of-squares.
  • the results from independent data partitions are combined with the approach by Schubert and Gertz [3].
  • the methods are validated for correctness against statistical methods from the GNU Scientific Library [4].

Usage

To use this library you simply need to copy the contents of the include folder inside your project, and then #include . Defining VSTAT_NAMESPACE before inclusion will allow you to set a custom namespace for the library.

Two convenience methods are provided for batch data:

  • univariate::accumulate for univariate statistics (mean, variance, standard deviation)
  • bivariate::accumulate for bivariate statistics (covariance, correlation)

The methods return a statistics object which contains all the stat values. For example:

(values.begin(), values.end(), weights.begin()); std::cout << "stats:\n" << stats << "\n"; count: 20 sum: 60 ssr: 20 mean: 3 variance: 1 sample variance: 1.05263 ">
std::vector<float> values{ 1.0, 2.0, 3.0, 4.0 };
std::vector<float> weights{ 2.0, 4.0, 6.0, 8.0 };

// unweighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end());
std::cout << "stats:\n" << stats << "\n";

count:                  4
sum:                    10
ssr:                    5
mean:                   2.5
variance:               1.25
sample variance:        1.66667

// weighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end(), weights.begin());
std::cout << "stats:\n" << stats << "\n";

count:                  20
sum:                    60
ssr:                    20
mean:                   3
variance:               1
sample variance:        1.05263

Besides iterators, it is also possible to provide raw pointers:

float x[] = { 1., 1., 2., 6. };
float y[] = { 2., 4., 3., 1. };
size_t n = std::size(x);

auto stats = bivariate::accumulate<float>(x, y, n);
std::cout << "stats:\n" << stats << "\n";

// results
count:                  4
sum_x:                  10
ssr_x:                  17
mean_x:                 2.5
variance_x:             4.25
sample variance_x:      5.66667
sum_y:                  10
ssr_y:                  5
mean_y:                 2.5
variance_y:             1.25
sample variance_y:      1.66667
correlation:            -0.759257
covariance:             -1.75
sample covariance:      -2.33333

It is also possible to use projections to aggregate stats over object properties:

(foos, bars, std::size(foos), [](auto const& foo) { return foo.value; }, [](auto const& bar) { return bar.value; }); std::cout << "stats:\n" << stats << "\n"; // results count: 5 sum_x: 19 ssr_x: 30.8 mean_x: 3.8 variance_x: 6.16 sample variance_x: 7.7 sum_y: 21 ssr_y: 62.8 mean_y: 4.2 variance_y: 12.56 sample variance_y: 15.7 correlation: 0.686676 covariance: 6.04 sample covariance: 7.55 ">
struct Foo {
    float value;
};

Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
auto stats = univariate::accumulate<float>(foos, std::size(foos), [](auto const& foo) { return foo.value; });
std::cout << "stats:\n" << stats << "\n";

// results
count:                  5
sum:                    19
ssr:                    30.8
mean:                   3.8
variance:               6.16
sample variance:        7.7

struct Foo {
    float value;
};

struct Bar {
    int value;
};

Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
Bar bars[] = { {3}, {2}, {1}, {4}, {11} };

auto stats = bivariate::accumulate<float>(foos, bars, std::size(foos), [](auto const& foo) { return foo.value; },
                                                                       [](auto const& bar) { return bar.value; });
std::cout << "stats:\n" << stats << "\n";

// results
count:                  5
sum_x:                  19
ssr_x:                  30.8
mean_x:                 3.8
variance_x:             6.16
sample variance_x:      7.7
sum_y:                  21
ssr_y:                  62.8
mean_y:                 4.2
variance_y:             12.56
sample variance_y:      15.7
correlation:            0.686676
covariance:             6.04
sample covariance:      7.55

The methods above accept a batch of data and calculate relevant statistics. If the data is streaming, then one can also use accumulators. The accumulator is a lower-level object that is able to perform calculations online as new data arrives:

univariate_accumulator<float> acc(1.0); // it's important to initialize the accumulator!
acc(2.0); // then we can stream values to it
acc(3.0);
acc(4.0);
auto stats = univariate_statistics(acc);
std::cout << "stats:\n" << stats << "\n";

Count:                  4
Sum:                    10
Sum of squares:         5
Mean:                   2.5
Variance:               1.25
Sample variance:        1.66667

The template parameter tells the accumulator how to represent data internally.

  • if a scalar type is provided (e.g. float or double), the accumulator will perform all operations with scalars (i.e., no SIMD).
  • if a SIMD-type is provided (e.g., Vec8f, Vec4d) then the accummulator will perform data-parallel operations

This allows the user to combine accumulators, for example using a SIMD-enabled accumulator to process the bulk of the data and a scalar accumulator for the left-over points.

Available statistics

  • univariate

    struct univariate_statistics {
        double count;
        double sum;
        double ssr;
        double mean;
        double variance;
        double sample_variance;
    };
  • bivariate

    struct bivariate_statistics {
        double count;
        double sum_x;
        double sum_y;
        double ssr_x;
        double ssr_y;
        double sum_xy;
        double mean_x;
        double mean_y;
        double variance_x;
        double variance_y;
        double sample_variance_x;
        double sample_variance_y;
        double correlation;
        double covariance;
        double sample_covariance;
    };

Benchmarks

The following libraries have been used for performance comparison in the univariate (variance) and bivariate (covariance) case:

Methodology

  • we generate 1M values uniformly distributed between [-1, 1] and save them into a double and a float array
  • increase the data size in 100k increments and benchmark the performance for each method using nanobench

Notes

  • we did not use MKL as a backend for numpy and gsl (expect MKL performance to be higher)
  • linasm methods for variance and covariance require precomputed array means, so means computation is factored into the benchmarks
  • hardware: Ryzen 9 5950X

Acknowledgements

[1] Vector Class Library

[2] Youngs and Cramer - Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms

[3] Schubert and Gertz - Numerically stable parallel computation of (co-)variance

[4] GNU Scientific Library

Owner
HEAL
Software releases of the research group Heuristic and Evolutionary Algorithms Laboratory (HEAL), situated at the FH OÖ Hagenberg Campus.
HEAL
A set of projects for quickly calculating the sine function using Chebyshev polynomials

sin_approx_04 Содержит несколько проектов, написанных на языке С. Цель их создания - реализовать быстрое вычисление тригонометрической функции синуса

null 7 May 10, 2022
Calculating PI with the help of probability and multiprocessing.

PI-Calculator Calculating PI with the help of probability and multiprocessing. NOTICE: This program only works on linux. After compiling the main.c fi

null 3 May 14, 2022
Pan-Genomic Matching Statistics

SPUMONI Pan-genomic Matching Statistics for Targeted Nanopore Sequencing Based on MONI: A MEM-finder with Multi-Genome References. MONI index uses the

Omar Ahmed 22 May 31, 2022
C++11 provides chainable and iterable object for uniform die casts. Useful for statistics or table top RPG simulations.

12/1/2018 Due to feedback about compile-time limitations, the api has been changed since release. The api now supports user-defined literals which mak

null 12 Sep 5, 2021
Get CPU & GPU temperatures and fan and battery statistics from your Mac.

macOS Hardware Stats Get CPU & GPU temperatures and fan and battery statistics from your Mac. This simple script will output a JSON array containing h

tigattack 4 May 5, 2022
OptimLib: a lightweight C++ library of numerical optimization methods for nonlinear functions

OptimLib OptimLib is a lightweight C++ library of numerical optimization methods for nonlinear functions. Features: A C++11 library of local and globa

Keith O'Hara 548 Jun 24, 2022
Mobile platform for analysis of localization methods using the Intel RealSense T265 sensor

OptiBot Mobile platform for analysis of localization methods using the Intel RealSense T265 sensor About | Content | Implementation | License | Author

Kamil Goś 2 Feb 17, 2022
Fast and Light-weight path smoothing methods for vehicles

path_smoother About Fast and Light-weight path smoothing methods for vehicles Denpendencies This project has been tested on Ubuntu 18.04. sudo apt-get

MingwangZhao 4 Dec 1, 2021
A gazebo actor plugin that utilizes the map of the environment and graph search methods to generate random actor trajectories that don't pass through walls, furniture, etc.

Gazebo-Map-Actor-Plugin A gazebo actor plugin that utilizes the map of the environment and graph search methods to generate random actor trajectories

Yasin Sonmez 10 Jun 25, 2022
Injection - Windows process injection methods

Windows Process Injection Here are some popular methods used for process injection on the windows operating system. Conhost ExtraBytes PROPagate Servi

null 1.2k Jun 28, 2022
A PIC/FLIP fluid simulation based on the methods found in Robert Bridson's "Fluid Simulation for Computer Graphics"

GridFluidSim3d This program is an implementation of a PIC/FLIP liquid fluid simulation written in C++11 based on methods described in Robert Bridson's

Ryan Guy 712 Jun 22, 2022
std::tuple like methods for user defined types without any macro or boilerplate code

Boost.PFR This is a C++14 library for very basic reflection that gives you access to structure elements by index and provides other std::tuple like me

Boost.org 1k Jun 21, 2022
Well-organized, commented and documented sample project that shows the basic functionalities of the 42's mlx library.

miniLibX sample | slucas-s I developed this sample project to play around with the basic functionalities of the miniLibX, the simple graphics library

S. Lucas Serrano 36 Jun 22, 2022
Arduino sample code to help you get started using the Soracom IoT Starter Kit!

Soracom IoT Starter Kit The Soracom IoT Starter Kit includes everything you need to build your first connected device. It includes an Arduino MKR GSM

Soracom Labs 13 Feb 22, 2022
ESP32 drum computer / sample player / midi sequencer (Arduino audio project)

esp32_drum_computer ESP32 drum computer / sample player / midi sequencer (Arduino audio project) The project can be seen in my video https://youtu.be/

Marcel 30 Jun 18, 2022
Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ text to use in a suitable player.

wextract Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ te

Paul Ferrand 9 Jan 5, 2022
Faster Non-Integer Sample Rate Conversion

Non-Integer Sample Rate Conversion This repository contains a comparison of sample-rate conversion (SRC) algorithms, with an emphasis on performance f

null 23 Mar 6, 2022
A sample demonstrating hybrid ray tracing and rasterisation for shadow rendering and use of the FidelityFX Denoiser.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

GPUOpen Effects 49 Jun 24, 2022
GrandOrgue is a sample based pipe organ simulator.

GrandOrgue is a sample based pipe organ simulator. It currently supports Linux, Windows and OS X. Porting to other OS supported by RtMidi,

GrandOrgue 58 Jun 22, 2022