monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

Overview

monolish: MONOlithic LIner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical algebra libraries.


monolish let developer forget about:

  • Performance tuning
  • Processor differences which execute library (Intel CPU, NVIDIA GPU, AMD CPU, ARM CPU, NEC SX-Aurora TSUBASA, etc.)
  • Vendor specific data transfer APIs (host RAM to Device RAM)
  • Finding bottlenecks and performance benchmarks
  • The argument data type of matrix/vector operations
  • Matrix structures and storage formats
  • Cumbersome package dependency

License

Copyright 2021 RICOS Co. Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • It seems that the `Dense(M, N, min, max)` constructor is not completely random.

    It seems that the `Dense(M, N, min, max)` constructor is not completely random.

    Running the following simple program

    #include <iostream>
    #include <monolish_blas.hpp>
    
    int main() {
      monolish::matrix::Dense<double>x(2, 3, 0.0, 10.0);
      x.print_all();
      return 0;
    }
    
    $ g++ -O3 main.cpp -o main.out -lmonolish_cpu
    

    will produce results like this.

    [email protected]:/$ ./main.out
    1 1 5.27196
    1 2 2.82358 <--
    1 3 2.13893 <--
    2 1 9.72054
    2 2 2.82358 <--
    2 3 2.13893 <--
    [email protected]:/$ ./main.out
    1 1 5.3061
    1 2 9.75236
    1 3 7.15652
    2 1 5.28961
    2 2 2.05967
    2 3 0.59838
    [email protected]:/$ ./main.out
    1 1 9.33149 <--
    1 2 4.75639 <--
    1 3 8.71093 <--
    2 1 9.33149 <--
    2 2 4.75639 <--
    2 3 8.71093 <--
    

    The arrows (<--) indicate that the number is repeating.

    This is probably due to that the pseudo-random number generator does not split well when it is parallelized by OpenMP.

    https://github.com/ricosjp/monolish/blob/1b89942e869b7d0acd2d82b4c47baeba2fbdf3e6/src/utils/dense_constructor.cpp#L120-L127

    This may happen not only with Dense, but also with random constructors of other data structures.

    I tested this on docker image ghcr.io/ricosjp/monolish/mkl:0.14.1.

    opened by lotz84 5
  • impl. transpose matvec, matmul

    impl. transpose matvec, matmul

    I want to give modern and intuitive transposition information. But I have no idea how to implement it easily.

    First, we create the following function as a prototype

    matmul(A,B,C) // C=AB
    matmul_TNN(A, B, C); // C=A^TB
    matvec(A,x,y); // y = Ax
    matvec_T(A, x, y); // y=A^Tx
    

    This interface is not beautiful. However, it has the following advantages

    • It does not affect other functions.
    • Easy to trace with logger
    • Simple to implement FFI in the future.
    • When beautiful ideas appear in the future, these functions can be implemented wrapping it.
    opened by t-hishinuma 2
  • try -fopenmp-cuda-mode flag

    try -fopenmp-cuda-mode flag

    memo:

    Clang supports two data-sharing models for Cuda devices: Generic and Cuda modes. The default mode is Generic. Cuda mode can give an additional performance and can be activated using the -fopenmp-cuda-mode flag. In Generic mode all local variables that can be shared in the parallel regions are stored in the global memory. In Cuda mode local variables are not shared between the threads and it is user responsibility to share the required data between the threads in the parallel regions.

    https://clang.llvm.org/docs/OpenMPSupport.html#basic-support-for-cuda-devices

    opened by t-hishinuma 2
  • Reserch the effect of the level information of the performance of cusparse ILU precondition

    Reserch the effect of the level information of the performance of cusparse ILU precondition

    The level information may not improve the performance but spend extra time doing analysis. For example, a tridiagonal matrix has no parallelism. In this case, CUSPARSE_SOLVE_POLICY_NO_LEVEL performs better than CUSPARSE_SOLVE_POLICY_USE_LEVEL. If the user has an iterative solver, the best approach is to do csrsv2_analysis() with CUSPARSE_SOLVE_POLICY_USE_LEVEL once. Then do csrsv2_solve() with CUSPARSE_SOLVE_POLICY_NO_LEVEL in the first run and with CUSPARSE_SOLVE_POLICY_USE_LEVEL in the second run, picking faster one to perform the remaining iterations.

    https://docs.nvidia.com/cuda/cusparse/index.html#csric02

    opened by t-hishinuma 2
  •  ignoring return value in test

    ignoring return value in test

    matrix_transpose.cpp:60:3: warning: ignoring return value of 'monolish::matrix::COO<Float>& monolish::matrix::COO<Float>::transpose() [with Float = double]', declared with attribute nodiscard [-Wunused-result]
       60 |   A.transpose();
    
    opened by t-hishinuma 2
  • Automatic deploy at release

    Automatic deploy at release

    impl. in github actions

    • [x] generate Doxyben (need to chenge version name)
    • [x] generate deb file
    • [x] generate monolish docker

    need to get version number...

    opened by t-hishinuma 2
  • write how to install nvidia-docker

    write how to install nvidia-docker

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee > /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt update -y
    sudo apt install -y nvidia-docker2
    sudo systemctl restart docker
    
    opened by t-hishinuma 1
  • Resolve conflict of libmonolish-cpu and libmonolish-nvidia-gpu deb package

    Resolve conflict of libmonolish-cpu and libmonolish-nvidia-gpu deb package

    What conflicts?

    libomp.so is contained in both package

    How to resolve?

    • (a) Use libomp5-12 distributed by ubuntu
    • (b) Create another package of libomp in allgebra stage (libomp-allgebra)
    opened by termoshtt 1
  • resolve curse of type name in src/

    resolve curse of type name in src/

    In src/, int and size_t are written. When change the class or function declarations in include/, I don't want to rewrite src/. Use auto or decltype() to remove them.

    opened by t-hishinuma 1
  • LLVM OpenMP Offloading can be installed by apt?

    LLVM OpenMP Offloading can be installed by apt?

    docker run -it --gpus all -v $PWD:/work nvidia/cuda:11.7.0-devel-ubuntu22.04
    ==
    apt update -y
    apt install -y git intel-mkl cmake ninja-build ccache clang clang-tools libomp-14-dev gcc gfortran
    git config --global --add safe.directory /work
    cd /work; make gpu
    

    pass??

    opened by t-hishinuma 0
  • cusparse IC / ILU functions is deprecated

    cusparse IC / ILU functions is deprecated

    but, sample code of cusparse is not updated

    https://docs.nvidia.com/cuda/cusparse/index.html#csric02

    I dont like trial and error, so wait for the sample code to be updated.

    opened by t-hishinuma 0
Releases(0.17.0)
Owner
RICOS Co. Ltd.
株式会社科学計算総合研究所 / Research Institute for Computational Science Co. Ltd.
RICOS Co. Ltd.
Pipy is a tiny, high performance, highly stable, programmable proxy.

Pipy Pipy is a tiny, high performance, highly stable, programmable proxy. Written in C++, built on top of Asio asynchronous I/O library, Pipy is extre

null 539 Dec 28, 2022
A programmable and highly maneuverable robotic cat for STEM education and AI-enhanced services.

OpenCat is the open-source Arduino and Raspberry Pi-based robotic pet framework developed by Petoi, the maker of futuristic programmable robot

Petoi LLC 921 Dec 29, 2022
Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

OpenAI 4.6k Dec 26, 2022
A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. The main features of this library are: High level API (just a line to cr

null 310 Jan 3, 2023
Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Alex Pashevich 61 Nov 17, 2022
Ncnn version demo of [CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search (ncnn) The official implementation by pytorch: ht

null 34 Dec 26, 2022
LibtorchSegmentation - A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: VGG, ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

English | 中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. ⭐ Please give a star if this project helps you. ⭐ The main fea

null 309 Dec 29, 2022
ParaMonte: Plain Powerful Parallel Monte Carlo and MCMC Library for Python, MATLAB, Fortran, C++, C.

Overview | Installation | Dependencies | Parallelism | Examples | Acknowledgments | License | Authors ParaMonte: Plain Powerful Parallel Monte Carlo L

Computational Data Science Lab 182 Dec 31, 2022
Fast parallel CTC.

In Chinese 中文版 warp-ctc A fast parallel implementation of CTC, on both CPU and GPU. Introduction Connectionist Temporal Classification is a loss funct

Baidu Research 4k Dec 26, 2022
Parallel library for approximate inference on discrete Bayesian networks

baylib C++ library Baylib is a parallel inference library for discrete Bayesian networks supporting approximate inference algorithms both in CPU and G

Massimiliano Pronesti 26 Dec 7, 2022
Parallel programming for everyone.

Tutorial | Examples | Forum Documentation | 简体中文文档 | Contributor Guidelines Overview Taichi (太极) is a parallel programming language for high-performan

Taichi Developers 22k Jan 4, 2023
SMID, Parallel computing of CNN

Parallel Computing in Deep Reference Network 1. Introduction Deep neural networks are made up of a number of layers of linked nodes, each of which imp

null 1 Dec 22, 2021
PaRSEC: the Parallel Runtime Scheduler and Execution Controller for micro-tasks on distributed heterogeneous systems.

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

null 18 Jan 1, 2023
Header-only ordinary differential equation solvers in C++20.

ODE Header-only ordinary differential equation solvers in C++20. Example: Lorenz system #include <ode/ode.hpp> using method_type = ode::explicit_met

Virtual Reality & Immersive Visualization Group at RWTH Aachen University 3 Apr 25, 2022
Parallel-util - Simple header-only implementation of "parallel for" and "parallel map" for C++11

parallel-util A single-header implementation of parallel_for, parallel_map, and parallel_exec using C++11. This library is based on multi-threading on

Yuki Koyama 27 Jun 24, 2022
Solving Kepler's equation via contour integration, implemented in C++

Kepler's Goat Herd Code for solving Kepler's equation using contour integration, following Philcox et al. (2021, arXiv). This uses a method originally

Oliver Philcox 45 Sep 11, 2022
Simple, single-file fluid solvers for learning purposes

Incremental fluids The purpose of this project is to provide simple, easy to understand fluid solver implementations in C++, together with code docume

Benedikt Bitterli 561 Dec 19, 2022
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

NeuroSim 11 Dec 15, 2022
a unix inspired, non posix compliant micro kernel (more of a monolithic kernel for now though) that i am working on in my spare time

toy-kernel a unix inspired, non posix compliant micro kernel (more of a monolithic kernel for now though) that i am working on in my spare time prereq

czapek 13 Nov 27, 2022