A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems

Overview

mpi-histo

A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems. The program is written in C and uses SPMD architecture to fork processes and compute histogram. There are two implementations:

histo.c => Using point to point communication MPI_Send, MPI_Recv to communicate between processes
histo.c => Using collective communication MPI_Broadcast, MPI_Scatter and MPI_Gather to communicate between processes

Compilation and Execution

The programs are compiled using mpicc and run using mpiexec. The compilation and execution instructions are shown below:

Compilation: mpicc mpi_histo.c -o

Execution: mpiexec -n

Taken from command line arguments:

no_processes  : Total number of processes to start.
bin_count     : Total number of bin_count for histogram.
data_min      : Minimum value for data
data_max      : Maximum value for data
data_count    : Number of random data points to generate.

By default this program creates number of random data points between data_min and data_max and computes the histogram for number of data items. The code can be modified according to scenarios or make it read a large data file and compute the histogram on real data.

Example:
Compilation: mpicc mpi_histo.c -o mpi_histo
Execution: mpiexec -n 2 ./mpi_histo 10 1 100 10

Performance Analysis

The histogram compuation was done for 50000000, 100000000 and 200000000 data points using 1 to 64 cores on a 4 node CHPC kingspeak cluster. The results obtained are described here.

1. Timing Performance


Total Execution time: It is the time taken by the program to complete the execution.
Application Time: This includes the time taken by the process to create and distribute data and compute histogram.
Histogram time: This is the time taken by one process to compute histogram on scattered data.

From the figures, for a serial program (process = 1) as the number of data elements increases the time taken to compute the histogram also increases. But as the number of processes is increased the total time to compute the histogram decreases as the work is divided among the processes. We can also see that the total time and the application time is greater than the time taken to compute the histogram. The reason behind this is, total execution time and application time also includes the time taken to scatter the data among the processes whereas, histogram time only accounts the maximum time taken by a process to compute the histogram.

2. Speedup Analysis


The speedup of a parallel program is given as: S(n,p) = Tserial(n) / Tparallel(n, p). From the figures we can see that the speedup for total time of the program is less than application and histogram computation time for a program. Similar to timing reports, total execution time of a program includes the time taken by process 0 to populate and scatter the data. But in the case of histogram time the speedup is drastically increased and is nearly a linear speedup.

Note: As the number of processes increases it adds a communication overhead on the system thus reducing the execution time.

Timing data obtained for my execution obtained from 4 Node (kingspeak) CHPC cluster:

Processes

Total Time

Application Time

Histogram Time

Data Count

1

74.423788

73.756131

73.569383

50000000

4

19.308111

18.636843

18.439549

50000000

16

5.504118

4.754371

4.615857

50000000

32

3.926105

3.230799

2.302202

50000000

64

3.436093

2.481364

1.150515

50000000

1

149.605372

148.257896

147.896947

100000000

4

38.644788

37.317932

36.934477

100000000

16

10.867961

9.537738

9.233756

100000000

32

7.780161

6.458228

4.580765

100000000

64

6.301298

4.971914

2.294652

100000000

1

298.528679

295.804794

294.842392

200000000

4

77.303529

74.629462

73.983486

200000000

16

21.717287

19.049401

18.434128

200000000

32

15.902424

13.048625

9.19137

200000000

64

12.687857

9.89958

4.63593

200000000

You might also like...
An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit
An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit

DREAMPlaceFPGA An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit. This work leverages the open-source A

SIMULATeQCD is a multi-GPU Lattice QCD framework that makes it simple and easy for physicists to implement lattice QCD formulas while still providing the best possible performance.

SIMULATeQCD a SImple MUlti-GPU LATtice code for QCD calculations SIMULATeQCD is a multi-GPU Lattice QCD framework that makes it simple and easy for ph

FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling
FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling

FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling Comparisons of Running Time of Our Method with SOTA methods RandLA and KPConv:

heuristically and dynamically sample (more) uniformly from large decision trees of unknown shape

PROBLEM STATEMENT When writing a randomized generator for some file format in a general-purpose programming language, we can view the resulting progra

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library,  for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

Square Root Bundle Adjustment for Large-Scale Reconstruction
Square Root Bundle Adjustment for Large-Scale Reconstruction

Square Root Bundle Adjustment for Large-Scale Reconstruction

[NeurIPS 2021 Spotlight] Learning to Delegate for Large-scale Vehicle Routing

Learning to Delegate for Large-scale Vehicle Routing This directory contains the code, data, and model for our NeurIPS 2021 Spotlight paper Learning t

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

Owner
Raj Shrestha
Raj Shrestha
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Dec 31, 2022
SecMML: Secure MPC(multi-party computation) Machine Learning Framework

SecMML 介绍 SecMML是FudanMPL(Multi-Party Computation + Machine Learning)的一个分支,是用于训练机器学习模型的高效可扩展的安全多方计算(MPC)框架,基于BGW协议实现。此框架可以应用到三个及以上参与方联合训练的场景中。目前,SecMM

null 90 Dec 27, 2022
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales

Fairring (FAIR + Herring): a faster all-reduce TL;DR: Using a variation on Amazon’s "Herring" technique, which leverages reduction servers, we can per

Meta Research 46 Nov 24, 2022
Codebase for "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems"

Codebase for "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems"

Beidi Chen 1k Dec 25, 2022
PaRSEC: the Parallel Runtime Scheduler and Execution Controller for micro-tasks on distributed heterogeneous systems.

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

null 18 Jan 1, 2023
The PULP Ara is a 64-bit Vector Unit, compatible with the RISC-V Vector Extension Version 0.9, working as a coprocessor to CORE-V's CVA6 core

Ara Ara is a vector unit working as a coprocessor for the CVA6 core. It supports the RISC-V Vector Extension, version 0.9. Dependencies Check DEPENDEN

null 185 Dec 24, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 5, 2023
A framework for generic hybrid two-party computation and private inference with neural networks

MOTION2NX -- A Framework for Generic Hybrid Two-Party Computation and Private Inference with Neural Networks This software is an extension of the MOTI

ENCRYPTO 15 Nov 29, 2022
VNOpenAI 31 Dec 26, 2022