Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference




Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.


Performance-Oriented Design

The matrix-multiplication routines are heavily-optimized for matrix shapes common in on-device ML tasks, including "skinny" ones. The matrix-multiplication kernels are tuned for specific CPUs with a large portion of inline assembly codes.

Here are benchmarks of SGEMM on 2 machines[1]:

armv8a cortex-A35 4-thread armv8a cortex-A53 4-thread
test1 test2

[1].The fomular of GEMM: C[MxN] = A[MxK] B[KxN]; For each test case, the better performance in all-row-major and all-column-major situations is selected.

Facile Interface

The data and parameters are passed straightforward without wrappings. Matrices and arrays are passed with base address + dimensions. GEMM parameters seldom used in on-device inference like LDA-LDC are excluded from the interface. There is no dependency on any third-party compute libraries.


EMLL abstracts the core structures of CPU-based high-performance matrix multiplication algorithms and also bias/quant functions to general macros (see files under include/common), which can be applied to a variety of processors. When developing for a new architecture, a lot of coding works can be saved with these macros.


EMLL provides a series of C functions. See Usage_EN.md for details.

Type Name Parameters
Matrix Multiplication data_type + "gemm" matrix_orders, addresses of matrices, M, N, K, beta, number of threads
Fully-connect Layer (fp32) "fc" addresses of src/weight/bias/output, dimensions M/K/N, orders of source matrices, (number of threads)
Quantization "quantize_" + "symmetric"/"asymmetric" + input_type + output_type input array, output array, (zero point), scale, size of array, input range
Requantization "requantize_" + "symmetric/asymmetric" + "_XtoY" input array, output array, (zero point), output scale, size of array, input range
Bias "bias" + data_type the matrix to be biased, scalar bias to all elements, vector bias along major direction, vector bias along minor direction, dimensions of the matrix

Supported Architectures and Data Types

Target CPU Matrix Multiplication Bias Quantization Requantization
ARMv7a 32-bit fp32 -> fp32, (u)int8 -> (u)int32 fp32, int32 fp32 -> (u)int8/(u)int16 int32 -> (u)int8/(u)int16, int16 -> (u)int8
ARMv8a 64-bit fp32 -> fp32, (u)int8 -> (u)int32, fp16 -> fp16 fp32, fp16, int32 fp32 -> (u)int8/(u)int16 int32 -> (u)int8/(u)int16, int16 -> (u)int8

Supported OS: Linux & Android

Supported Compilers: GCC & Clang

Future Plan

EMLL may support on-device GPUs and NPUs in the future, with the expansion of available functions, according to business requirements.


Apache 2.0


Eigen: [https://eigen.tuxfamily.org]

OpenBLAS: [https://github.com/xianyi/OpenBLAS]

  • how to solve cross compile EMLL problem below

    how to solve cross compile EMLL problem below

    /EMLL/src/arm_neon/ARMCompareAndSwap.c:1:0: error: invalid feature modifier in '-march=armv8.2-a+dotprod+fp16' /*****************************************************************************/

    CMakeFiles/eml-armneon.dir/build.make:62: recipe for target 'CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o' failed make[2]: *** [CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o] Error 1 CMakeFiles/Makefile2:109: recipe for target 'CMakeFiles/eml-armneon.dir/all' failed make[1]: *** [CMakeFiles/eml-armneon.dir/all] Error 2 Makefile:129: recipe for target 'all' failed make: *** [all] Error 2

    opened by xuhaoguang 18
  • A53平台速度比arm compute library慢

    A53平台速度比arm compute library慢

    我试了这样的一个流程,Android aarch64 a53 quantize_symmetric_f32_s8 s8s32gemm requantize_symmetric_32to8 s8s32gemm requantize_symmetric_32to8 .... s8s32gemm 和requantize_symmetric_32to8 这样循环数层 dequantize_symmetric_f32_s32 矩阵mn=(mk)x(k*n), 大概分布是这样的几个矩阵m=8,k=100,n=400; m=8,k=400,n=100; 速度差不多是acl的两倍. A76平台上多个矩阵是比acl快一点的,很赞. 看到a35,a53,a7x是有不同的代码优化的

    opened by TPcoding 8
  • 关于与BLAS中Gemm的对应



    主要的区别在于 Transpose似乎EMLL中是没有的,其他的参数含义应该是一样的?

    EMLL: a_rowmajor | 源矩阵 A 的排列顺序,非零表示行主序 -- | -- b_rowmajor | 源矩阵 B 的排列顺序,非零表示行主序 A | 源矩阵 A 的地址 B | 源矩阵 B 的地址 C | 输出矩阵 C 的地址 M | 矩阵 A 的行数 N | 矩阵 B 的列数 K | A的列数,必须等于 B 的行数 beta | 作用于矩阵 C 的预乘因子 num_threads | 并行时能够使用的线程数

    OpenBLAS: int an = a->dimSize[0]; int am = a->dimSize[1]; int bn = b->dimSize[0]; int bm = b->dimSize[1]; int cn = c->dimSize[0]; int cm = c->dimSize[1]; GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, cn, cm, am, alpha, (DTYPE*)a->data, am, (DTYPE*)b->data, bm, beta, (DTYPE*)c->data, cm)

    opened by bcmi220 6
  • 多线程使用emll会造成内存占用增大


    一个进程创建了四个线程: thread-1:执行emll的gemm (使用1个线程计算,即gemm最后一个参数为1) thread-2:sleep thread-3:sleep thread-4:sleep 这比一个进程只创建一个线程: thread-1:执行emll的gemm (使用1个线程计算,即gemm最后一个参数为1) 占用更多内存,这个增大的内存占用来源于emll,经过我的统计,每多一个线程会增加大约768kb的内存,这个是否跟CommonDriver.h中GEMM_STATIC_BUFFER这个有关,这个也刚好创建了768kb内存(1024x192x4/1024)。 想问一下我该如何解决这种问题呢,我的线程2/3/4并不需要emll的gemm运算,如何不增加该内存占用。

    opened by Chen1399 5
  • 如何在编译EMLL时选择性打开GEMM的某种优化手段?


    通过EMLL的介绍,GEMM的优化主要有 分块、重排 和 汇编优化 三个手段,请问如何选择性的让其中某个或者某几个手段生效呢?


    opened by xuhaoguang 1
NetEase Youdao
NetEase Youdao
A lightweight C++ machine learning library for embedded electronics and robotics.

Fido Fido is an lightweight, highly modular C++ machine learning library for embedded electronics and robotics. Fido is especially suited for robotic

The Fido Project 411 May 31, 2022
A C++ standalone library for machine learning

Flashlight: Fast, Flexible Machine Learning in C++ Quickstart | Installation | Documentation Flashlight is a fast, flexible machine learning library w

Facebook Research 4.3k Jun 30, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4k Jun 23, 2022
Flashlight is a C++ standalone library for machine learning

Flashlight is a fast, flexible machine learning library written entirely in C++ from the Facebook AI Research Speech team and the creators of Torch and Deep Speech.

null 4.3k Jul 3, 2022
ML++ - A library created to revitalize C++ as a machine learning front end

ML++ Machine learning is a vast and exiciting discipline, garnering attention from specialists of many fields. Unfortunately, for C++ programmers and

marc 999 Jun 22, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.2k Jun 26, 2022
Machine Learning Framework for Operating Systems - Brings ML to Linux kernel

Machine Learning Framework for Operating Systems - Brings ML to Linux kernel

File systems and Storage Lab (FSL) 175 Jun 6, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 22.9k Jul 1, 2022
Samsung Washing Machine replacing OS control unit

hacksung Samsung Washing Machine WS1702 replacing OS control unit More info at https://www.hackster.io/roni-bandini/dead-washing-machine-returns-to-li

null 24 May 12, 2022
RNNLIB is a recurrent neural network library for sequence learning problems. Forked from Alex Graves work http://sourceforge.net/projects/rnnl/

Origin The original RNNLIB is hosted at http://sourceforge.net/projects/rnnl while this "fork" is created to repeat results for the online handwriting

Sergey Zyrianov 869 Jun 26, 2022
Caffe: a fast open framework for deep learning.

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR)/The Berke

Berkeley Vision and Learning Center 32.7k Jun 24, 2022
Distributed (Deep) Machine Learning Community 677 Apr 14, 2022
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.

Frog - A Tagger-Lemmatizer-Morphological-Analyzer-Dependency-Parser for Dutch Copyright 2006-2020 Ko van der Sloot, Maarten van Gompel, Antal van den

Language Machines 69 Jun 20, 2022
C-based/Cached/Core Computer Vision Library, A Modern Computer Vision Library

Build Status Travis CI VM: Linux x64: Raspberry Pi 3: Jetson TX2: Backstory I set to build ccv with a minimalism inspiration. That was back in 2010, o

Liu Liu 6.9k Jun 23, 2022
libsvm websitelibsvm - A simple, easy-to-use, efficient library for Support Vector Machines. [BSD-3-Clause] website

Libsvm is a simple, easy-to-use, and efficient software for SVM classification and regression. It solves C-SVM classification, nu-SVM classification,

Chih-Jen Lin 4.2k Jun 22, 2022
Open Source Computer Vision Library

OpenCV: Open Source Computer Vision Library Resources Homepage: https://opencv.org Courses: https://opencv.org/courses Docs: https://docs.opencv.org/m

OpenCV 62.4k Jul 3, 2022
oneAPI Data Analytics Library (oneDAL)

Intel® oneAPI Data Analytics Library Installation | Documentation | Support | Examples | Samples | How to Contribute Intel® oneAPI Data Analytics Libr

oneAPI-SRC 504 Jun 29, 2022
A C library for product recommendations/suggestions using collaborative filtering (CF)

Recommender A C library for product recommendations/suggestions using collaborative filtering (CF). Recommender analyzes the feedback of some users (i

Ghassen Hamrouni 249 Jun 13, 2022
An open library of computer vision algorithms

VLFeat -- Vision Lab Features Library Version 0.9.21 The VLFeat open source library implements popular computer vision algorithms specialising in imag

VLFeat.org 1.5k Jun 20, 2022