BladeDISC - BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.


BladeDISC Introduction


BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads, which is one of the key components of Alibaba's PAI-Blade. BladeDISC provides general, transparent, and ease of use performance optimization for TensorFlow/PyTorch workloads on GPGPU and CPU backends. The architecture natively supports dynamic shape workloads, with many considerations in the performance of both static and dynamic shape scenarios. It also supports multiple and flexible deployment solutions, including both Plugin Mode inside TensorFlow/PyTorch runtime, and Standalone Mode for AOT standalone execution. The project is based on MLIR and highly related with mlir-hlo project.

Refer to our website for more information, including the setup tutorial, developer guide, demo examples and documents for developers.

Features and Roadmap

Frontend Framework Support Matrix

TensorFlow [1] PyTorch [2]
Inference Yes Yes
Training Yes [3] Ongoing

[1] TensorFlow 1.12, 1.15, 2.4 & 2.5 are supported and fully verified. For other versions some slight works on adaptation might be needed.

[2] 1.6.0 <= PyTorch version < 1.9.0 has been fully verified.

[3] Although supported, there's much room for improvement on Op coverage for training workloads.

Backend Support Matrix

Memory Intensive Part Compute Intensive Part End-to-End Usability
Nvidia GPU Yes Yes Yes
AMD GPU Ongoing Ongoing No
Hygon DCU Yes Yes Yes
X86 Yes Not open-sourced yet [1] No

[1] The compute-intensive part of the X86 backend is already supported on the internal version. The code decoupling is ongoing and will be open-sourced soon, same for the end-to-end usability.

Deployment Solutions

  • Plugin Mode - BladeDISC works as a plugin of TensorFlow or PyTorch. Only the supported Ops are clustered and compiled, and the unsupported ones will be executed by the original TensorFlow or PyTorch runtime. We recommend this mode to most of the users for its transparency and ease of use.

  • Standalone Mode - In Standalone mode, the input workload will be compiled into a binary that can be executed by it self, aka, does not rely on a TensorFlow or PyTorch runtime. In this mode all ops must be supported.

Numbers of Typical Workloads

By evaluating BladeDISC using a set of typical machine learning workloads for production purpose, DISC shows up to 3x speedup compared with TensorFlow/PyTorch.


Advantage in Dynamic Shape Workloads

Specifically, for the BERT large inference on T4 we provide in the examples, static compiler optimization (XLA) shows severe performance degradation due to its compilation overhead, while DISC shows a 1.75x speedup.

TensorFlow XLA DISC
1.78 s 41.69s 1.02s
1X 1.75X

API QuickView

For TensorFlow Users

Only two lines of code are needed on native Tensorflow program as the following:

import numpy as np
import tensorflow as tf

## enable BladeDISC on TensorFlow program
import tensorflow_blade_disc as disc

## construct TensorFlow Graph and run it
g = tf.Graph()
with g.as_default():
    with tf.session as sess:

For more information, please refer to QuickStart for TensorFlow Users

For PyTorch Users

PyTorch users only need the following few lines of code to enable BladeDISC:

import torch_blade
# construct PyTorch Module
class MyModule(nn.Module):

module = MyModule()

with torch.no_grad():
    # blade_module is the optimized module by BladeDISC
    blade_module = torch_blade.optimize(module, allow_tracing=True, model_inputs=(x, y))

# run the optimized module
blade_module(x, y)

torch_blade.optimize accepts an nn.Module object and outputs the optimized module. For more information, please refer to Quickstart for PyTorch Users.

Setup and Examples


Tutorials and Documents for Developers

How to Contribute


Roadmap with mlir-hlo Project

BladeDISC is in a close relationship with mlir-hlo project. Part of the building blocks, including the MHLO Op definitions, TF to MHLO conversions, and some general purpose passes have been upstreamed to mlir-hlo repository. We'll continue to work in a close cooperative relationship with mlir-hlo project in the longer term.

Contact Us


  • Install from source without docker

    Install from source without docker

    Hi, we are tring to use spack to build and install BladeDISC without docker, however, we are facing some problems.

    1. Spack installs tensorflow from source, and protobuf is installed separately, so the detection logic in FindTensorflow.cmake is broken.
    2. Bazel version in the bundled tensorflow is >4.2.2, while the supported tf2.4/2.5 requires bazel3.7.2

    Can you give us some instrution on how to build BladeDISC outside docker?

    opened by asesidaa 12
  • Hello, there are some errors~

    Hello, there are some errors~

    subprocess.CalledProcessError: Command 'set -e; set -o pipefail; source .bazel_pyenv/bin/activate; bazel test --action_env PYTHON_BIN_PATH=/usr/bin/python3 --action_env BAZEL_LINKLIBS=-lstdc++ --action_env CC=/usr/bin/gcc --action_env CXX=/usr/bin/g++ --copt=-DPYTORCH_VERSION_STRING="1.7.1+cu110" --copt=-DPYTORCH_MAJOR_VERSION=1 --copt=-DPYTORCH_MINOR_VERSION=7 --copt=-DTORCH_BLADE_CUDA_VERSION=11.0 --action_env TORCH_BLADE_TORCH_INSTALL_PATH=/usr/local/lib/python3.6/dist-packages/torch --config=torch_debug --config=torch_tensorrt --action_env TENSORRT_INSTALL_PATH=/usr/local/TensorRT/ --action_env NVCC=/usr/local/cuda//bin/nvcc --config=torch_cxx11abi_0 --config=torch_cuda //tests/mhlo/... //src:torch_blade_test_suite' returned non-zero exit status 2.

    opened by eAzure 10
  • Support fusing

    Support fusing "isSplat" constants

    The problem is observed in swin-transformer, when pytorch is doing amp.

    "lmhlo.constant"(%275) {value = dense<0.000000e+00> : tensor<64x784x768xf32>} : (memref<64x784x768xf32, "gpu">) -> ()
    "lmhlo.fusion"() ( {
          "lmhlo.multiply"(%1741, %275, %1742) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
          "lmhlo.terminator"() : () -> ()
     }) {disc.device = "gpu", = "main_kLoop_reshape__37_1_2" : () -> ()
    "lmhlo.fusion"() ( {
          "lmhlo.multiply"(%1888, %275, %1889) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
          "lmhlo.terminator"() : () -> ()
     }) {disc.device = "gpu", = "main_kLoop_reshape__37_1_3" : () -> ()

    In general, the "splat" constants outside of a fusion kernel might cause severe performance issues. In swin-transformer, the performance degradation can be very severe. Please be aware that there might be multiple kernels consuming the splat constant.

    Solution 1: mark "splat" constant as fusible in fusion pass; and add an additional fusion stage that allows to duplicate the producer according to some forms of rules, like the FusionMerger in XLA.

    Solution 2: add an additional FuseSplatConstPass after the regular fusion pass that specifically duplicate and fuse the Splat const into fusion kernels.

    Both solutions need also to support the fusion codegen for "splat" constants. Solution 2 can be regarded as a shrink version of solution 1, which can not handle such cases:

    "lmhlo.constant"(%272) {value = dense<0.000000e+00> : tensor<64x784x768xf32>} : (memref<64x784x768xf32, "gpu">) -> ()
    "lmhlo.add"(%272, %273, %275)  {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
    "lmhlo.fusion"() ( {
          "lmhlo.multiply"(%1741, %275, %1742) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
          "lmhlo.terminator"() : () -> ()
     }) {disc.device = "gpu", = "main_kLoop_reshape__37_1_2" : () -> ()
    "lmhlo.fusion"() ( {
          "lmhlo.multiply"(%1888, %275, %1889) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
          "lmhlo.terminator"() : () -> ()
     }) {disc.device = "gpu", = "main_kLoop_reshape__37_1_3" : () -> ()
    opened by linearhit 10
  • build tao_compiler_main takes so long

    build tao_compiler_main takes so long

    I try to build tensorflow blade disc, use the commmand which is bash /path/to/BladeDISC/scripts/ci/, and it takes so long time for building, if i want to add or delete some code in the project ,how can i comp building faster?

    2022-07-25 07:08:06: configure_bridge_bazel - 2.38 minutes
    2022-07-25 07:08:07: configure - 2.39 minutes
    2022-07-25 07:13:00: build_tao_bridge - 4.88 minutes
    2022-07-25 07:13:00: build_blade_gemm - 0.00 minutes
    2022-07-25 08:23:14: bazel_build(//tensorflow/compiler/decoupling:tao_compiler_main) - 70.23 minutes
    2022-07-25 08:25:17: bazel_build(//tensorflow/compiler/mlir/disc:disc-opt) - 2.05 minutes
    2022-07-25 08:28:30: bazel_build(//tensorflow/compiler/mlir/disc/tools/disc-replay:disc-replay-main) - 3.23 minutes
    2022-07-25 08:28:31: build_tao_compiler - 75.50 minutes
    2022-07-25 08:28:31: build_blade_gemm - 0.00 minutes
    2022-07-25 09:33:28: build_mlir_ral - 64.95 minutes
    opened by LucQueen 9
  • [Example][BERT] No performance improvement on V100 comparing with XLA

    [Example][BERT] No performance improvement on V100 comparing with XLA

    Problom Statement

    I tried BladeDISC's BERT example on V100. The result, turns out surprisingly that BladeDISC has no performance improvement overt XLA. I'm wondering is it expected? or there is something wrong?


    • GPU: V100
    • BladeDISC: latest-runtime-tensorflow1.15
    • Benchmark code:
      I did some minor changes to the example code, by simplely testing every batch size multiple times. This is critical to XLA, because it would have a incredible long compiling time at the first few steps. Running every batch size multiple times would help us identify those warmup steps. image


    1. XLA: image
    2. DISC: image


    Excluding those incredible long time XLA warm up steps (it's compiling), I can hardly find any batch size on which DISC surpass XLA.

    opened by zhuwenxi 8
  • disable some static checks of mhlo.dot_general

    disable some static checks of mhlo.dot_general

    Recently mhlo community introduces more static checks for mhlo dot_general op. Without these change, following IR is valid:

    mhlo.dot_general(%0, %1) : (tensor<?x?x?xf32>, <4x?x?x?f32>)

    After the above change, it's invalid due to mismatch static batch dimension size.

    However, tf.BatchMatmul(tensor<?x?x?xf32>, tensor<4x?x?xf32>) is still valid and the tf.BatchMatmul tf2mhlo converter does not handle shape propagation between the lhs & rhs, leading to the failure of the above static check.

    This CL just disables the check in tf_community as a workaround.

    opened by wyzero 6
  • [Debug]How to verify compiled cluster results against original tf subgraph?

    [Debug]How to verify compiled cluster results against original tf subgraph?

    Hi, there! I am now trying to use BladeDISC to optimize my tensorflow model, I have successfully get my model compiled by DISC, however, when checking the optimized model output against original tf version, I found there is a big difference. So i think there might be something wrong with some of the disc-compiled clusters which i am trying to figure out. After I spent some time diving into the code, i found a setting FLAG called tao_launch_enable_check. According to the code comments, when this flag is ON, it should check the compiled cluster result against tf original subgraph. But i could not find the implementation for this flag. So i am wandering if this logic has been implemented in the codebase. If not, could you please give some advice on implementing it? I would be glad to contribute to this feature if it is with my ablility. Thanks a lot! @linearhit

    opened by EddieBurning 6
  • Support gemm pre-packing via onednn/acl on aarch64

    Support gemm pre-packing via onednn/acl on aarch64

    ACL NEGEMM itself supports weight pre-packing by default. Without this PR, we create a new dnnl::matmul primitive for each matmul call, which in turn creates a new NEGEMM object inside the primitive on aarch64. We can not re-use the pre-packed weight from previous matmul call even for the same matmul configuration since the NEGEMM is destroyed after each matmul call. This RP tries to cache the matmul primitive across matmul calls and thus can make use of the underlying pre-packed weight for compatible matmul calls.

    opened by wyzero 5
  • Kernel Fusion of BladeDISC

    Kernel Fusion of BladeDISC

    I would like to ask if there is a way to get the kernel fusion results of bladeDISC(such as bert)? I tried the Nsight System but the kernel name is not very intutitive, it is difficult to deduce what operations the kernel includes. I also tried the "TORCH_BLADE_DEBUG_LOG" option of DISC, but the obtained IR does not seem to be able to see the kernel-level information.

    opened by LiZerun 4
  • torch-blade's `//src:torch_blade_include` may lead to illegal cache behaviour.

    torch-blade's `//src:torch_blade_include` may lead to illegal cache behaviour.

    By depending on the above, bazel targets may have C++ source root as include root other than bazel workspace root. As a consequence, targets my reference any header file without explicitly adding it to deps. This may cause some problem when remote build cache is introduced. Bazel seems not treating change of any header file under . as change of target torch_blade_include.

    It's better to have explicit header dependency.

    CI TorchBlade 
    opened by qiuxiafei 4
  • [WIP] [to split into smaller PRs] A large set of GPU codegen optimization.

    [WIP] [to split into smaller PRs] A large set of GPU codegen optimization.

    1. Loop unroll and interleave.
    2. Row-reduce ops interleave in kStitch fusion.
    3. Schedule selection optimization adapts to target GPU architecture.
    4. Flags to enable CUDA fast-math.
    5. Other detailed refine of codegen schedule for kStitch fusion.
    opened by JamesTheZ 4
  • [transform] add a two-level tiling schedule and related UTs

    [transform] add a two-level tiling schedule and related UTs

    Following is some preliminary data:

    test on g6r; single thread; A, B and C are fully dynamic (pre-packing is not possible in such case)
      | -- | -- | -- | --
      | m, n, k | DISC + transform (ms) | DISC + ACL (ms)
      | 304, 256, 256 | 1.02 | 1.00
      | 304, 512, 256 | 2.00 | 2.02
      | 304, 1024, 256 | 4.10 | 4.00
      | 304, 1024, 512 | 8.56 | 7.99
      | 1024, 1024, 1024 | 60.0 | 52.8
      | 34, 512, 256 | 0.301 | 0.293
      | 74, 512, 256 | 0.561 | 0.544
      | 174, 512, 256 | 1.19 | 1.207
      | 34, 256, 256 | 0.135 | 0.158
      | 74, 256, 256 | 0.272 | 0.281
      | 174, 256, 256 | 0.592 | 0.589

    to #787

    opened by wyzero 0
  • [TorchBench] Performance Signal Detected

    [TorchBench] Performance Signal Detected

    TorchBench CI has detected a performance signal.

    Affected Tests:

    • eval-cuda-fp32:
      • attention_is_all_you_need_pytorch[disc (latency)] 5.622->6.551, -16.5244%
      • attention_is_all_you_need_pytorch[dynamo-disc (latency)] 5.626->6.48, -15.1795%
      • hf_Bert[disc (latency)] 10.123->11.481, -13.415%
      • hf_Bert[dynamo-disc (latency)] 10.504->11.314, -7.7113%
      • nvidia_deeprecommender[blade (latency)] 5.567->5.736, -3.0357%
      • resnet18[blade (latency)] 2.219->2.347, -5.7684%
      • resnet18[dynamo-blade (latency)] 2.226->2.363, -6.1545%

    detail data can be seen in oss://bladedisc-ci/TorchBench/partial/20221127-13 created by TorchBench CI automatically

    opened by zzpmiracle 1
  • New fusion decision logic for GEMM fusion and codegen.

    New fusion decision logic for GEMM fusion and codegen.

    The fusion decision of kDot in last version is not robust. It actually relies on horizontal fusion to merge dot and other kloop fusions together, which is easy to fail given the checking logic in initialization phase of fusion pattern.

    This PR gives a robust fusion decision for kDot fusion. It does not rely on horizontal fusion any more. It first applies a pre-fusion phase for element-wise ops supported by kDot fusion, and then fuse kDot with such element-wise groups together. It also allows to prune the fusion decision to make it meet the codegen constraint.

    opened by JamesTheZ 0
  • v0.2.0(May 11, 2022)

    Release 0.2.0

    Performance Optimization

    GPU stitch fusion

    Make use of GPU shared memory to fuse reduce operator with its consumers into one kernel. It helps to accommodate complex memory-intensive computations (e.g., LayerNorm, SoftMax) into one kernel, reducing off-chip memory traffics and overhead of kernel scheduling and launching. It implements partial functions described in paper AStitch. It is currently under refactoring to enhance the robustness, for which it is not enabled by default. Users of BladeDISC can enable it by setting the environment variable DISC_ENABLE_STITCH=true.

    Note that we have already released the CPU stitch optimization when we open-source the BladeDISC project, which is enabled by default. Refer to the materials for more information about CPU stitch technique details.

    GEMM merging

    Support two types of GEMM merging optimization. One is to merge two GEMMs sharing the same operand into a single GEMM. The other one is to merge two GEMMs with the same shape into a batched GEMM. The GEMM merging optimization helps to increase hardware utilization and to reduce kernel launch overhead.

    CPU GEMM/Convolution weight pre-packing optimization

    Support weight pre-packing optimization for convolution (calling onednn library) and GEMM (calling mkl/onednn/acl libraries) operations.

    Convolution layout optimization and transpose elimination

    Support to transform the layout of convolution operator to the friendliest format on the specific device (i.e., either CPU or GPU). Most of the introduced transpose operators can be eliminated in a following transpose-simplifier pass.

    Other optimizations

    • Optimize the schedule selection strategy for reduce operator on GPU to enhance thread-level-parallelism.
    • Algebraic simplification for operators like power.
    • Support to fuse splat constant operator with its consumers, reducing memory access overhead. Refer to issue.

    Function Enhancement

    CPU end-to-end optimization

    Support end-to-end optimization for X86 and AArch64 CPUs.

    TorchBlade/TensorFlowBlade clustering and optimizing with TensorRT

    According to the supported operators of TensorRT, cluster sub-graphs and apply TensorRT optimization for both TensorFlow and PyTorch models.

    Accelerating PyTorch Training

    Release PoC version for accelerating PyTorch training via Disc + Lazy Tensor Core, referring to the related issue and design doc.

    Shape analysis and simplifier enhancement

    Enhance the shape equality analysis according to the dimension values. Add the function to analyze the collapse and expand relationship between dimensions, which helps to identify the dimension mapping between input and output values of reshape operator. This is the basic function to support GPU stitch fusion.

    Codegen support for int8 datatype

    Support int8 datatype for the code generation of memory-intensive operators (e.g., element-wise, reduce operators).

    Toolchain Support and Process Optimization

    Replay tool

    Support to dump clusters and the corresponding input data, based on which developers can replay the execution. It is effective to help debugging and tuning. Refer to issue.

    CI optimization

    Enhance the CI process of BladeDISC repo, which helps the people from community to contribute to BladeDISC more conveniently and efficiently.

    TorchBlade bazel build

    Migrate TorchBlade's compilation toolchain from the original CMake to bazel, enhancing maintainability.


    Example preparation

    Prepare a set of commonly used models as the examples for BladeDISC. Compare the performance of BladeDISC with TensorRT, XLA and ONNX Runtime (ORT) upon the examples.

    Community TF rebase

    Rebase to TensorFlow codebase for BladeDISC according to the newest community code.

    Code maintenance

    Continuous bug fixing and code refactoring.

    Source code(tar.gz)
    Source code(zip)
Alibaba Open Source
WeNet is to close the gap between research and production end-to-end (E2E) speech recognition models,

WeNet is to close the gap between research and production end-to-end (E2E) speech recognition models, to reduce the effort of productionizing E2E models, and to explore better E2E models for production.

null 2.6k Nov 23, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Nov 17, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Fatih Küçükkarakurt 5 Apr 5, 2022
[3DV 2021] DSP-SLAM: Object Oriented SLAM with Deep Shape Priors

DSP-SLAM Project Page | Video | Paper This repository contains code for DSP-SLAM, an object-oriented SLAM system that builds a rich and accurate joint

Jingwen Wang 352 Nov 21, 2022
Reviatalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation

Reviatalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation This is the implementation of the approach describ

Taosha Fan 47 Nov 15, 2022
NeeDrop: Self-supervised Shape Representation from Sparse Point Clouds using Needle Dropping

NeeDrop: Self-supervised Shape Representation from Sparse Point Clouds using Needle Dropping by: Alexandre Boulch, Pierre-Alain Langlois, Gilles Puy a 26 Sep 6, 2022
heuristically and dynamically sample (more) uniformly from large decision trees of unknown shape

PROBLEM STATEMENT When writing a randomized generator for some file format in a general-purpose programming language, we can view the resulting progra

John Regehr 4 Feb 15, 2022
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20.2k Nov 30, 2022
Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

OpenAI 4.3k Nov 25, 2022
An Open Source Machine Learning Framework for Everyone

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

null 169.3k Nov 26, 2022
Distributed machine learning platform

Veles Distributed platform for rapid Deep learning application development Consists of: Platform - Znicz Plugin - Neu

Samsung 898 Nov 9, 2022
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

Siavash Eliasi 33 May 31, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.8k Nov 26, 2022
A lightweight C++ machine learning library for embedded electronics and robotics.

Fido Fido is an lightweight, highly modular C++ machine learning library for embedded electronics and robotics. Fido is especially suited for robotic

The Fido Project 412 Sep 19, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.4k Nov 24, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Nov 25, 2022
Feature Store for Machine Learning

Overview Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production. Please see ou

Feast 3.8k Nov 24, 2022
Machine Learning Platform for Kubernetes

Reproduce, Automate, Scale your data science. Welcome to Polyaxon, a platform for building, training, and monitoring large scale deep learning applica

polyaxon 3.2k Nov 25, 2022