Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Overview

logo

中文介绍

Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

Features

Performance-Oriented Design

The matrix-multiplication routines are heavily-optimized for matrix shapes common in on-device ML tasks, including "skinny" ones. The matrix-multiplication kernels are tuned for specific CPUs with a large portion of inline assembly codes.

Here are benchmarks of SGEMM on 2 machines[1]:

armv8a cortex-A35 4-thread armv8a cortex-A53 4-thread
test1 test2

[1].The fomular of GEMM: C[MxN] = A[MxK] B[KxN]; For each test case, the better performance in all-row-major and all-column-major situations is selected.

Facile Interface

The data and parameters are passed straightforward without wrappings. Matrices and arrays are passed with base address + dimensions. GEMM parameters seldom used in on-device inference like LDA-LDC are excluded from the interface. There is no dependency on any third-party compute libraries.

Extensibility

EMLL abstracts the core structures of CPU-based high-performance matrix multiplication algorithms and also bias/quant functions to general macros (see files under include/common), which can be applied to a variety of processors. When developing for a new architecture, a lot of coding works can be saved with these macros.

EMLL APIs

EMLL provides a series of C functions. See Usage_EN.md for details.

Type Name Parameters
Matrix Multiplication data_type + "gemm" matrix_orders, addresses of matrices, M, N, K, beta, number of threads
Fully-connect Layer (fp32) "fc" addresses of src/weight/bias/output, dimensions M/K/N, orders of source matrices, (number of threads)
Quantization "quantize_" + "symmetric"/"asymmetric" + input_type + output_type input array, output array, (zero point), scale, size of array, input range
Requantization "requantize_" + "symmetric/asymmetric" + "_XtoY" input array, output array, (zero point), output scale, size of array, input range
Bias "bias" + data_type the matrix to be biased, scalar bias to all elements, vector bias along major direction, vector bias along minor direction, dimensions of the matrix

Supported Architectures and Data Types

Target CPU Matrix Multiplication Bias Quantization Requantization
ARMv7a 32-bit fp32 -> fp32, (u)int8 -> (u)int32 fp32, int32 fp32 -> (u)int8/(u)int16 int32 -> (u)int8/(u)int16, int16 -> (u)int8
ARMv8a 64-bit fp32 -> fp32, (u)int8 -> (u)int32, fp16 -> fp16 fp32, fp16, int32 fp32 -> (u)int8/(u)int16 int32 -> (u)int8/(u)int16, int16 -> (u)int8

Supported OS: Linux & Android

Supported Compilers: GCC & Clang

Future Plan

EMLL may support on-device GPUs and NPUs in the future, with the expansion of available functions, according to business requirements.

License

Apache 2.0

Reference

Eigen: [https://eigen.tuxfamily.org]

OpenBLAS: [https://github.com/xianyi/OpenBLAS]

Issues
  • how to solve cross compile EMLL problem below

    how to solve cross compile EMLL problem below

    /EMLL/src/arm_neon/ARMCompareAndSwap.c:1:0: error: invalid feature modifier in '-march=armv8.2-a+dotprod+fp16' /*****************************************************************************/

    CMakeFiles/eml-armneon.dir/build.make:62: recipe for target 'CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o' failed make[2]: *** [CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o] Error 1 CMakeFiles/Makefile2:109: recipe for target 'CMakeFiles/eml-armneon.dir/all' failed make[1]: *** [CMakeFiles/eml-armneon.dir/all] Error 2 Makefile:129: recipe for target 'all' failed make: *** [all] Error 2

    opened by xuhaoguang 18
  • A53平台速度比arm compute library慢

    A53平台速度比arm compute library慢

    我试了这样的一个流程,Android aarch64 a53 quantize_symmetric_f32_s8 s8s32gemm requantize_symmetric_32to8 s8s32gemm requantize_symmetric_32to8 .... s8s32gemm 和requantize_symmetric_32to8 这样循环数层 dequantize_symmetric_f32_s32 矩阵mn=(mk)x(k*n), 大概分布是这样的几个矩阵m=8,k=100,n=400; m=8,k=400,n=100; 速度差不多是acl的两倍. A76平台上多个矩阵是比acl快一点的,很赞. 看到a35,a53,a7x是有不同的代码优化的

    opened by TPcoding 8
  • 关于与BLAS中Gemm的对应

    关于与BLAS中Gemm的对应

    非常感谢您分享这个工作,在使用EMLL替换OpenBLAS的时候我发现两者的参数有一些区别,您能够给出一些如何进行迁移参数的意见么或者给出一个简单的README,这样也可以有助于推广EMLL。

    主要的区别在于 Transpose似乎EMLL中是没有的,其他的参数含义应该是一样的?

    EMLL: a_rowmajor | 源矩阵 A 的排列顺序,非零表示行主序 -- | -- b_rowmajor | 源矩阵 B 的排列顺序,非零表示行主序 A | 源矩阵 A 的地址 B | 源矩阵 B 的地址 C | 输出矩阵 C 的地址 M | 矩阵 A 的行数 N | 矩阵 B 的列数 K | A的列数,必须等于 B 的行数 beta | 作用于矩阵 C 的预乘因子 num_threads | 并行时能够使用的线程数

    OpenBLAS: int an = a->dimSize[0]; int am = a->dimSize[1]; int bn = b->dimSize[0]; int bm = b->dimSize[1]; int cn = c->dimSize[0]; int cm = c->dimSize[1]; GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, cn, cm, am, alpha, (DTYPE*)a->data, am, (DTYPE*)b->data, bm, beta, (DTYPE*)c->data, cm)

    opened by bcmi220 6
  • 多线程使用emll会造成内存占用增大

    多线程使用emll会造成内存占用增大

    一个进程创建了四个线程: thread-1:执行emll的gemm (使用1个线程计算,即gemm最后一个参数为1) thread-2:sleep thread-3:sleep thread-4:sleep 这比一个进程只创建一个线程: thread-1:执行emll的gemm (使用1个线程计算,即gemm最后一个参数为1) 占用更多内存,这个增大的内存占用来源于emll,经过我的统计,每多一个线程会增加大约768kb的内存,这个是否跟CommonDriver.h中GEMM_STATIC_BUFFER这个有关,这个也刚好创建了768kb内存(1024x192x4/1024)。 想问一下我该如何解决这种问题呢,我的线程2/3/4并不需要emll的gemm运算,如何不增加该内存占用。

    opened by Chen1399 5
  • 如何在编译EMLL时选择性打开GEMM的某种优化手段?

    如何在编译EMLL时选择性打开GEMM的某种优化手段?

    通过EMLL的介绍,GEMM的优化主要有 分块、重排 和 汇编优化 三个手段,请问如何选择性的让其中某个或者某几个手段生效呢?

    因为我自测发现,同一份数据在单独写demo在设备上测试是没问题的,但是集成到项目中,在设备上测试,会出core,经分析大概率是由于内存不足导致,所以怀疑对于GEMM的三个优化手段,是不是有些优化手段非常耗费内存?

    opened by xuhaoguang 1
Releases(v1.0)
Owner
NetEase Youdao
NetEase Youdao
Forward - A library for high performance deep learning inference on NVIDIA GPUs

a library for high performance deep learning inference on NVIDIA GPUs.

Tencent 123 Mar 17, 2021
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Tencent 502 May 31, 2022
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

NVIDIA Corporation 5.5k Jun 27, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

null 814 Jun 19, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

NeuroSim 20 Jun 1, 2022
Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and deploy without Python.

Python Inference Script(PyIS) Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and d

Microsoft 10 Feb 23, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.6k Jul 1, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 13.9k Jun 30, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jun 24, 2022
A flexible, high-performance serving system for machine learning models

XGBoost Serving This is a fork of TensorFlow Serving, extended with the support for XGBoost, alphaFM and alphaFM_softmax frameworks. For more informat

iQIYI 115 Jun 20, 2022
SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters

SHARK Communication Channels GitHub issues: Feature requests, bugs etc Nod.ai SHARK Discord server: Real time discussions with the nod.ai team and oth

nod.ai 28 Jun 27, 2022
An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

手写AI 1k Jun 28, 2022
ncnn is a high-performance neural network inference framework optimized for the mobile platform

ncnn ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployme

Tencent 14.9k Jul 2, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Jun 26, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Fatih Küçükkarakurt 5 Apr 5, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Tencent 108 May 19, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

null 75 Apr 14, 2022
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

iwatake 146 Jun 19, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

Xiaomi 4.6k Jun 26, 2022