A profiler to disclose and quantify hardware features on GPUs.

Overview

ArchProbe

ArchProbe is a profiling tool to demythify mobile GPU architectures with great details. The mechanism of ArchProbe is introduced in our technical paper which is still under review.

Adreno & Mali Architecture Overview
Architecture details collected with ArchProbe, presented in our technical paper.

How to Use

In a clone of ArchProbe code repository, the following commands build ArchProbe for most mobile devices with a 64-bit ARMv8 architecture.

git submodule update --init --recursive
mkdir build-android-aarch64 && cd build-android-aarch64
cmake -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK/build/cmake/android.toolchain.cmake" -DANDROID_ABI="arm64-v8a" -DANDROID_PLATFORM=android-28 -G "Ninja" ..
cmake --build . -t ArchProbe

To run ArchProbe in command line via adb shell, you need to copy the executables to /data/local/tmp.

If you are using Windows, the PowerShell scripts in scripts can be convenient too:

scripts/Run-Android.ps1 [-Verbose]

Prebuilt Binaries

Prebuilt binaries will be available here.

How to Interpret Outputs

A GPU hardware has many traits like GFLOPS and cache size. ArchProbe implements a bag of tricks to expose these traits and each implementation is called an aspect. Each aspect has its own configurations in ArchProbe.json, reports in ArchProbeReport.json, and data table of every run of probing kernels in [ASPECT_NAME].csv. Currently ArchProbe implements the following aspects:

  • WarpSizeMethod{A|B} Two methods to detect the warp size of a GPU core;
  • GFLOPS Peak computational throughput of the device;
  • RegCount Number of registers available to a thread and whether the register file is shared among warps;
  • BufferVecWidth Optimal vector width to read the most data in a single memory access;
  • {Image|Buffer}CachelineSize Top level cacheline size of image/buffer;
  • {Image|Buffer}Bandwidth Peak read-only bandwidth of image/buffer;
  • {Image|Buffer}CacheHierarchyPChase Size of each level of cache of image/buffer by the P-chase method.

If the -v flag is given, ArchProbe prints extra human-readable logs to stdout which is also a good source of information.

Experiment data gathered from Google Pixel 4 can be found here.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

You might also like...
Radeon Rays is ray intersection acceleration library for hardware and software multiplatforms using CPU and GPU

RadeonRays 4.1 Summary RadeonRays is a ray intersection acceleration library. AMD developed RadeonRays to help developers make the most of GPU and to

Finds static ORB features in a video(excluding the dynamic objects), typically for a SLAM scenario

static-ORB-extractor : SORBE Finds static ORB features in a video(excluding the dynamic objects), typically for a SLAM scenario Requirements OpenCV 3

Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.
Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.

Isaac ROS Visual Odometry This repository provides a ROS2 package that estimates stereo visual inertial odometry using the Isaac Elbrus GPU-accelerate

This is an example of Artificial Intelligence (AI) calculations on a very cheap hardware.
This is an example of Artificial Intelligence (AI) calculations on a very cheap hardware.

This is an example of Artificial Intelligence (AI) calculations on a very cheap hardware.

A lightweight, portable pure C99 onnx inference engine for embedded devices with hardware acceleration support.
A lightweight, portable pure C99 onnx inference engine for embedded devices with hardware acceleration support.

Libonnx A lightweight, portable pure C99 onnx inference engine for embedded devices with hardware acceleration support. Getting Started The library's

A PoC for requesting HWIDs directly from hardware, skipping any potential hooks or OS support.
A PoC for requesting HWIDs directly from hardware, skipping any potential hooks or OS support.

PCIBan A PoC for requesting HWIDs directly from hardware, skipping any potential hooks or OS support. This is probably very unsafe, not supporting edg

Open source modules to interface Metavision Intelligence Suite with event-based vision hardware equipment

Metavision: installation from source This page describes how to compile and install the OpenEB codebase. For more information, refer to our online doc

ROS2 packages based on NVIDIA libArgus library for hardware-accelerated CSI camera support.
ROS2 packages based on NVIDIA libArgus library for hardware-accelerated CSI camera support.

Isaac ROS Argus Camera This repository provides monocular and stereo nodes that enable ROS developers to use cameras connected to Jetson platforms ove

Comments
  • Values between in GPU hierarchy images and output files

    Values between in GPU hierarchy images and output files

    The two captured images from the document "Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs" as below: image image

    (1) The maximum number of registers for each work item (2) The size, cacheline size, and bandwidth of memory hierarchy including all levels of unified and texture caches, local, constant and global memory (3) The number of threads in a warp (4) The number of ALUs in a shader core

    But after I output the .json and .csv files, I cannot make all connections between the values of properties and the values of GPUs hierarchy properties. Can you please figure out all of them, especially (2) and (4) ?

    Besides, I have found some connections and make a list as below (just take Adreno 640 GPU for example), can you please check it right or wrong?

    • Shader cores count: Value in ArchProbeReport.json - [Device] - SmCount?
    • Execute engines count: Value in where?
    • ALUs count: Value in where?
    • Warp size: Value in [ArchProbeReport.json] - [WarpSizeMethod{A|B}] - WarpThreadCount?
    • What's 384, value in where?
    • Registers count: Value (is 181?) in ArchProbeReport.json - [RegCount] - RegCount?
    • Register type (Pooled / Dedicated): Value in ArchProbeReport.json - [RegCount] - RegType?
    • Register bits: Value (4B) in where?
    • Texture L1 cache bandwidth: Value in ArchProbeReport.json - [ImageBandwidth] - MinBandwidth / MaxBandwidth?
    • Local memory bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
    • Unified L2 cache bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
    • Constant memory bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
    • Texture L1 cache size: Value in where?
    • Local memory size: Value in where?
    • Unified L2 cache size: Value in where?
    • Constant memory size: Value in where?
    • Global memory bandwidth: Value in where?
    question 
    opened by Smeegol 3
  • Does this contains the code to reproduce or generate the figures or data of the document?

    Does this contains the code to reproduce or generate the figures or data of the document?

    opened by Smeegol 2
  • Get stuck when run ./ArchProbe on Adreno 640 on Meizu 16s

    Get stuck when run ./ArchProbe on Adreno 640 on Meizu 16s

    Environment:

    Device: Meizu 16s
    Android version: 9
    SoC: Qualcomm Snapdragon 825
    GPU: Adreno 640
    

    Have already compiled as https://github.com/microsoft/ArchProbe#how-to-use, and pushed ArchProbe into /data/local/tmp/ArchProbe of target device.

    But when run ./ArchProbe, it get stuck after some while.

    $ adb shell
    16s:/ $ cd /data/local/tmp/ArchProbe
    16s:/data/local/tmp/ArchProbe $ ./ArchProbe
    [INFO] initialized opencl environment
    [INFO] selected device #0: QUALCOMM Snapdragon(TM) (FULL_PROFILE) - QUALCOMM Adreno(TM) (4, OpenCL 2.0 Adreno(TM) 640)
    [WARN] configuration file cannot be opened at 'ArchProbe.json', a default configuration will be created
    [WARN] report file cannot be opened at 'ArchProbeReport.json', a new report will be created
    [INFO] set-up testing environment
    [INFO] fetched device report
    [INFO]     (qualcomm extension) device page size is 4KB
    [INFO]     2GB global memory with 128KB cache consists of 64B cachelines
    [INFO]     images up to [16384, 16384] texels are supported
    [INFO]     2 SMs with 1024 logical threads in each
    [INFO] [Device]
    [WARN]     aspect report ('Device') is invalid, a new record is created
    [INFO]     reported 'SmCount' = '2'
    [INFO]     reported 'LogicThreadCount' = '1024'
    [INFO]     reported 'MaxBufferSize' = '2882611200'
    [INFO]     reported 'CacheSize' = '131072'
    [INFO]     reported 'CachelineSize' = '64'
    [INFO]     reported 'MaxImageWidth' = '16384'
    [INFO]     reported 'MaxImageHeight' = '16384'
    [INFO]     reported 'PageSize_QCOM' = '4096'
    [INFO]     reported 'Done' = '1'
    [INFO] already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [INFO] [WarpSizeMethodA]
    [INFO]     initialized table for aspect 'WarpSizeMethodA'
    [WARN]     aspect report ('WarpSizeMethodA') is invalid, a new record is created
    [INFO]     reported 'WarpThreadCount' = '128'
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'WarpSizeMethodA.csv'
    [INFO] [WarpSizeMethodB]
    [INFO]     initialized table for aspect 'WarpSizeMethodB'
    [WARN]     aspect configuration ('WarpSizeMethodB') is invalid, a new record is created
    [WARN]     record entry ('Compensate') is invalid, a new record is created
    [WARN]     record entry ('Threshold') is invalid, a new record is created
    [INFO]     found minimal niter=100 to take 1000us
    [WARN]     aspect report ('WarpSizeMethodB') is invalid, a new record is created
    [INFO]     reported 'WarpThreadCount' = '64'
    [INFO]     discovered the warp size being 64 by method b
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'WarpSizeMethodB.csv'
    [INFO] [Gflops]
    [INFO]     initialized table for aspect 'Gflops'
    [INFO]     already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [INFO]     already know that 'SmCount' from aspect 'Device' is 2
    [WARN]     aspect configuration ('Gflops') is invalid, a new record is created
    [WARN]     record entry ('Compensate') is invalid, a new record is created
    [WARN]     record entry ('Threshold') is invalid, a new record is created
    [INFO]     found minimal niter=100 to take 1000us
    [WARN]     aspect report ('Gflops') is invalid, a new record is created
    [INFO]     reported 'HalfArch' = 'SISD'
    [INFO]     reported 'HalfVecComponentCount' = '1'
    [INFO]     reported 'HalfGflops' = '890.102'
    [INFO]     found minimal niter=100 to take 1000us
    [INFO]     reported 'FloatArch' = 'SISD'
    [INFO]     reported 'FloatVecComponentCount' = '1'
    [INFO]     reported 'FloatGflops' = '889.921'
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'Gflops.csv'
    [INFO] [RegCount]
    [INFO]     already know that 'Done' from aspect 'Device' is 1
    [INFO]     initialized table for aspect 'RegCount'
    [INFO]     already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [WARN]     aspect configuration ('RegCount') is invalid, a new record is created
    [WARN]     record entry ('Compensate') is invalid, a new record is created
    [WARN]     record entry ('Threshold') is invalid, a new record is created
    [WARN]     record entry ('NRegMin') is invalid, a new record is created
    [WARN]     record entry ('NRegMax') is invalid, a new record is created
    [WARN]     record entry ('NRegStep') is invalid, a new record is created
    [WARN]     record entry ('NGrpMin') is invalid, a new record is created
    [WARN]     record entry ('NGrpMax') is invalid, a new record is created
    [WARN]     record entry ('NGrpStep') is invalid, a new record is created
    [INFO]     found minimal niter=21546 to take 1000us
    [WARN]     aspect report ('RegCount') is invalid, a new record is created
    [INFO]     testing register availability when only 1 thread is dispatched
    [INFO]     183 registers are available at most
    [INFO]     reported 'RegCount' = '183'
    [INFO]     using 183 registers can have 12 concurrent single-thread workgroups
    [INFO]     reported 'FullRegConcurWorkgroupCount' = '12'
    [INFO]     using 91 registers can have 24 concurrent single-thread workgroups
    [INFO]     reported 'HalfRegConcurWorkgroupCount' = '24'
    [INFO]     all physical threads in an sm share 183 registers
    [INFO]     reported 'RegType' = 'Pooled'
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'RegCount.csv'
    [INFO] [BufferVecWidth]
    [INFO]     initialized table for aspect 'BufferVecWidth'
    [INFO]     already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [WARN]     aspect configuration ('BufferVecWidth') is invalid, a new record is created
    [WARN]     record entry ('Compensate') is invalid, a new record is created
    [WARN]     record entry ('Threshold') is invalid, a new record is created
    [INFO]     found minimal niter=4229 to take 1000us
    [WARN]     aspect report ('BufferVecWidth') is invalid, a new record is created
    [INFO]     optimal buffer read size is 32B
    [INFO]     discovered the optimal vectorization for buffer access in 32b-words be 4
    [INFO]     reported 'BufferVecSize' = '4'
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'BufferVecWidth.csv'
    [INFO] [ImageCachelineSize]
    [INFO]     initialized table for aspect 'ImageCachelineSize'
    [INFO]     already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [INFO]     already know that 'MaxImageWidth' from aspect 'Device' is 16384
    [INFO]     already know that 'MaxImageHeight' from aspect 'Device' is 16384
    [WARN]     aspect configuration ('ImageCachelineSize') is invalid, a new record is created
    [WARN]     record entry ('Compensate') is invalid, a new record is created
    [WARN]     record entry ('Threshold') is invalid, a new record is created
    [WARN]     aspect report ('ImageCachelineSize') is invalid, a new record is created
    [INFO]     testing image cacheline size along dim=0
    [INFO]     found minimal niter=7974 to take 1000us
    [INFO]     can concurrently access 64px with minimal cost along dim=0
    [INFO]     reported 'ImgMinTimeConcurThreadCountX' = '64'
    [INFO]     testing image cacheline size along dim=1
    [INFO]     found minimal niter=7759 to take 1000us
    [INFO]     can concurrently access 32px with minimal cost along dim=1
    [INFO]     reported 'ImgMinTimeConcurThreadCountY' = '32'
    [INFO]     reported 'ImgCachelineSize' = '32'
    [INFO]     reported 'ImgCachelineDim' = 'X'
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'ImageCachelineSize.csv'
    [INFO] [BufferCachelineSize]
    [INFO]     initialized table for aspect 'BufferCachelineSize'
    [INFO]     already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [INFO]     already know that 'CacheSize' from aspect 'Device' is 131072
    [WARN]     aspect configuration ('BufferCachelineSize') is invalid, a new record is created
    [WARN]     record entry ('Compensate') is invalid, a new record is created
    [WARN]     record entry ('Threshold') is invalid, a new record is created
    [INFO]     found minimal niter=414 to take 1000us
    [WARN]     aspect report ('BufferCachelineSize') is invalid, a new record is created
    [INFO]     testing buffer cacheline size
    [INFO]     top level buffer cacheline size is 64B
    [INFO]     reported 'BufTopLevelCachelineSize' = '64'
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'BufferCachelineSize.csv'
    [INFO] [ImageBandwidth]
    [INFO]     initialized table for aspect 'ImageBandwidth'
    [INFO]     already know that 'MaxImageWidth' from aspect 'Device' is 16384
    [INFO]     already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [INFO]     already know that 'SmCount' from aspect 'Device' is 2
    [WARN]     aspect report ('ImageBandwidth') is invalid, a new record is created
    [INFO]     reported 'MaxBandwidth' = '190.326'
    [INFO]     reported 'MinBandwidth' = '67.9003'
    [INFO]     discovered image read bandwidth min=67.9003; max=190.326
    [INFO]     reported 'Done' = '1'
    [INFO]     saved data table to 'ImageBandwidth.csv'
    [INFO] [BufferBandwidth]
    [INFO]     initialized table for aspect 'BufferBandwidth'
    [INFO]     already know that 'LogicThreadCount' from aspect 'Device' is 1024
    [INFO]     already know that 'SmCount' from aspect 'Device' is 2
    
    bug 
    opened by Smeegol 2
Releases(v0.0.1)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)

Merlin: HugeCTR HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-T

null 745 Dec 1, 2022
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 647 Nov 28, 2022
ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

Xtra Computing Group 1.4k Nov 28, 2022
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

NVIDIA Corporation 6.3k Dec 3, 2022
SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters

SHARK Communication Channels GitHub issues: Feature requests, bugs etc Nod.ai SHARK Discord server: Real time discussions with the nod.ai team and oth

nod.ai 85 Dec 1, 2022
Forward - A library for high performance deep learning inference on NVIDIA GPUs

a library for high performance deep learning inference on NVIDIA GPUs.

Tencent 123 Mar 17, 2021
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Tencent 509 Nov 21, 2022
NVIDIA GPUs htop like monitoring tool

NVTOP What is NVTOP? Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about

Maxime Schmitt 4.5k Nov 29, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 58 Nov 7, 2022
A collection of useful features to customize and improve existing OpenXR applications.

OpenXR Toolkit This software provides a collection of useful features to customize and improve existing OpenXR applications, including render upscalin

Matthieu Bucchianeri 145 Nov 26, 2022