GPTPU: General-Purpose Computing on (Edge) Tensor Processing Units

Overview

GPTPU: General-Purpose Computing on (Edge) Tensor Processing Units

Welcome to the repository of ESCAL @ UCR's GPTPU project! We aim at demonstrating the power of matrix processing units (MXUs) that are now ubiquitous in all types of computing platforms. This project chooses Google's Edge TPU -- a "relatively" open archtecture that allows everyone to purchase and integrate into their systems. In our preliminary results, we achieve 2.46x speedup over one single high-end CPU core. You may reference our arXiv paper https://arxiv.org/pdf/2107.05473.pdf or the paper coming up in SC21 for more information.

DOI

Hardware installation

You will need an M.2 version of the edge TPU (recommeded) https://coral.ai/docs/m2/get-started/ or a USB edge TPU accelerator to installed in your system.

Once you have the Edge TPUs, please follow Google's document to install their drivers and toolchains before installing our GPTPU framework. https://coral.ai/docs/m2/get-started/#2-install-the-pcie-driver-and-edge-tpu-runtime

You may also reference Section 3.1 of our arXiv paper to build a multi-Edge-TPU machine (a lot cheaper) or purchase ASUS's 8x Edge TPU PCIe card https://iot.asus.com/products/AI-accelerator/AI-Accelerator-PCIe-Card/

Install GPTPU library (Our contribution)

Compile all benchmarks

$ make 

Run all benchmarks

$ make run
// each benchmark shows its RMSE and error rate as mentioned in paper. Some may involve experimental features.

gptpu library is pre-compiled as libgptpu.so and linked by Makefile.

Compile the gptpu library

// rune the Makefile_gptpu, while it requires sudo permission
// sc21 is simply an demo account without sudo permission

Prerequisites

tensorflow 1.13.1 // Python-based model creation creates the template for the first time if not exist bazel 2.0.0 cnpy (https://github.com/rogersce/cnpy) cmake python3 numpy apex driver gesket driver cblas (for comparison only)

$ sudo apt-get install libopenblas-dev

Set PATH

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Note about GEMM

GEMM is fundamental and it's our very first benchmark. It also includes exact mode as experimental feature. Our exact mode is still in progress while it adopts blocking algorithm in a block size of 256 to avoid uint8_t overflow. In this demo, we show the floating point approximation result with small RMSE and error rate as mentioned in paper.

Multi-tpu scheme

GPTPU library allows enabling multiple TPUs for parallel computing. This following device initialization API

open_devices(int opening_order, int wanted_dev_cnt)

has two arguments opening_order and wanted_dev_cnt.

  1. opening_order: 0: open device(s) sequentially starting from first device (index 0). 1: open device(s) sequentially starting from a random number device. (You can extend this argument with more opening policies)
  2. wanted_dev_cnt: number of devices you want to open. (constrained by maximum number of devices available)

openctpu usage

Please refer to the example source code : ./src/openctpu.cc in detail.

You might also like...
SMID, Parallel computing of CNN
SMID, Parallel computing of CNN

Parallel Computing in Deep Reference Network 1. Introduction Deep neural networks are made up of a number of layers of linked nodes, each of which imp

4eisa40 GPU computing : exploiting the GPU to execute advanced simulations

GPU-computing 4eisa40 GPU computing : exploiting the GPU to execute advanced simulations Activities Parallel programming Algorithms Image processing O

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

A project demonstrating how to train your own gesture recognition deep learning pipeline. We start with a pre-trained detection model, repurpose it for hand detection using Transfer Learning Toolkit 3.0, and use it together with the purpose-built gesture recognition model. Once trained, we deploy this model on NVIDIA® Jetson™ using Deepstream SDK.
Mirror of compiler project code. Not for SVC purpose.

Compiler-proj Project progress is updated here. Progress 2021/11/28: Started! Set up Makefile and finished basic scanner. 2021/10/24: Repo created. Ac

The purpose of this project is to apply mediapipe to more AI chips.
The purpose of this project is to apply mediapipe to more AI chips.

1.About This Project Our Official Website: www.houmo.ai Who We Are: We are Houmo - A Great AI Company. We wish to change the world with unlimited comp

NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.
NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

Real-ESRGAN ncnn Vulkan This project is the ncnn implementation of Real-ESRGAN. Real-ESRGAN ncnn Vulkan heavily borrows from realsr-ncnn-vulkan. Many

General broad-phase collision detection framework using BVH and BVTT front tracking.

This is the collision detection package by littlemine (Xinlei Wang). Configuration Instructions This project is developed using Visual Studio 2015 and

Super Mario Remake using C++, SFML, and Image Processing which was a project for Structure Programming Course, 1st Year
Super Mario Remake using C++, SFML, and Image Processing which was a project for Structure Programming Course, 1st Year

Super Mario Remake We use : C++ in OOP concepts SFML for game animations and sound effects. Image processing (Tensorflow and openCV) to add additional

Releases(v0.99-alpha)
Owner
Extreme Storage and Computer Architecture Lab
The github repo for ESCAL @ UCR
Extreme Storage and Computer Architecture Lab
PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.

PSTensor : Custimized a Tensor Data Structure Compatible with PyTorch and TensorFlow. You may need this software in the following cases. Manage memory

Jiarui Fang 8 Feb 12, 2022
Yet another tensor library in C++. It allows direct access to its underlying data buffer, and serializes in JSON.

Yet another tensor library in C++. It allows direct access to its underlying data buffer, and serializes in JSON. Built on top of zax json parser, C++ structures having tensor members can also be JSON-serialized and deserialized, allowing one to save and load the state of a highly hierarchical object.

Tamas Levente Kis 2 Dec 15, 2022
TensorFlow Lite, Coral Edge TPU samples (Python/C++, Raspberry Pi/Windows/Linux).

TensorFlow Lite, Coral Edge TPU samples (Python/C++, Raspberry Pi/Windows/Linux).

Nobuo Tsukamoto 87 Nov 16, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

NetEase Youdao 179 Dec 20, 2022
LIDAR(Livox Horizon) point cloud preprocessing, including point cloud filtering and point cloud feature extraction (edge points and plane points)

LIDAR(Livox Horizon) point cloud preprocessing, including point cloud filtering and point cloud feature extraction (edge points and plane points)

hongyu wang 12 Dec 28, 2022
EdgeKiller is a simple application that fully replaces Microsoft Edge with the Browser of choice.

EdgeKiller EdgeKiller is a simple application that fully replaces Microsoft Edge with the Browser of choice, while also intercepting all the microsoft

Jan Ochwat 2 Nov 30, 2021
MobileNet Image Classification with ESP32-CAM and Edge Impulse (TinyML)

MobileNet Image Classification on ESP32-CAM and Edge Impulse (TinyML) This example is for running a MobileNet neural network model on a 10-dollar Ai-T

Alan Wang 16 Dec 19, 2022
An efficient C++17 GPU numerical computing library with Python-like syntax

MatX - Matrix Primitives Library MatX is a modern C++ library for numerical computing on NVIDIA GPUs. Near-native performance can be achieved while us

NVIDIA Corporation 625 Jan 1, 2023
Boki: Stateful Serverless Computing with Shared Logs [SOSP '21]

Boki Boki is a research FaaS runtime for stateful serverless computing with shared logs. Boki exports the shared log API to serverless functions, allo

Operating Systems and Architecture 40 Jan 5, 2023
Experimental and Comparative Performance Measurements of High Performance Computing Based on OpenMP and MPI

High-Performance-Computing-Experiments Experimental and Comparative Performance Measurements of High Performance Computing Based on OpenMP and MPI 实验结

Jiang Lu 1 Nov 27, 2021