Machine Learning Platform for Kubernetes

Overview

License: Apache 2 Polyaxon API Slack

Docs Release GitHub GitHub

Operator Core Api scheduler

Hub Helm Charts Codacy Badge


polyaxon

Reproduce, Automate, Scale your data science.


Welcome to Polyaxon, a platform for building, training, and monitoring large scale deep learning applications. We are making a system to solve reproducibility, automation, and scalability for machine learning applications.

Polyaxon deploys into any data center, cloud provider, or can be hosted and managed by Polyaxon, and it supports all the major deep learning frameworks such as Tensorflow, MXNet, Caffe, Torch, etc.

Polyaxon makes it faster, easier, and more efficient to develop deep learning applications by managing workloads with smart container and node management. And it turns GPU servers into shared, self-service resources for your team or organization.


demo


Install

TL;DR;

  • Install CLI

    # Install Polyaxon CLI
    $ pip install -U polyaxon
  • Create a deployment

    # Create a namespace
    $ kubectl create namespace polyaxon
    
    # Add Polyaxon charts repo
    $ helm repo add polyaxon https://charts.polyaxon.com
    
    # Deploy Polyaxon
    $ polyaxon admin deploy -f config.yaml
    
    # Access API
    $ polyaxon port-forward

Please check polyaxon installation guide

Quick start

TL;DR;

  • Start a project

    # Create a project
    $ polyaxon project create --name=quick-start --description='Polyaxon quick start.'
  • Train and track logs & resources

    # Upload code and start experiments
    $ polyaxon run -f experiment.yaml -l
  • Dashboard

    # Start Polyaxon dashboard
    $ polyaxon dashboard
    
    Dashboard page will now open in your browser. Continue? [Y/n]: y
  • Notebook

    # Start Jupyter notebook for your project
    $ polyaxon run --hub notebook
  • Tensorboard

    # Start TensorBoard for a run's output
    $ polyaxon run --hub tensorboard --run-uuid=UUID

compare dashboards tensorboard compare


Please check our quick start guide to start training your first experiment.

Distributed job

Polyaxon supports and simplifies distributed jobs. Depending on the framework you are using, you need to deploy the corresponding operator, adapt your code to enable the distributed training, and update your polyaxonfile.

Here are some examples of using distributed training:

Hyperparameters tuning

Polyaxon has a concept for suggesting hyperparameters and managing their results very similar to Google Vizier called experiment groups. An experiment group in Polyaxon defines a search algorithm, a search space, and a model to train.

Parallel executions

You can run your processing or model training jobs in parallel, Polyaxon provides a mapping abstraction to manage concurrent jobs.

DAGs and workflows

Polyaxon DAGs is a tool that provides container-native engine for running machine learning pipelines. A DAG manages multiple operations with dependencies. Each operation is defined by a component runtime. This means that operations in a DAG can be jobs, services, distributed jobs, parallel executions, or nested DAGs.

Architecture

Polyaxon architecture

Documentation

Check out our documentation to learn more about Polyaxon.

Dashboard

Polyaxon comes with a dashboard that shows the projects and experiments created by you and your team members.

To start the dashboard, just run the following command in your terminal

$ polyaxon dashboard -y

Project status

Polyaxon is stable and it's running in production mode at many startups and Fortune 500 companies.

Contributions

Please follow the contribution guide line: Contribute to Polyaxon.

Research

If you use Polyaxon in your academic research, we would be grateful if you could cite it.

Feel free to contact us, we would love to learn about your project and see how we can support your custom need.

Issues
  • Tensorboard error for the quick-start example

    Tensorboard error for the quick-start example

    Describe the bug

    I'm running the examples from the quick-start guide and when I tried to start Tensorboard I got the error:

    Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 316, in create_or_update_deployment return self.create_deployment(name=name, body=body), True File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 302, in create_deployment namespace=self.namespace, body=body File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 175, in create_namespaced_deployment (data) = self.create_namespaced_deployment_with_http_info(namespace, body, **kwargs) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 266, in create_namespaced_deployment_with_http_info collection_formats=collection_formats) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api _return_http_data_only, collection_formats, _preload_content, _request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api _request_timeout=_request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '374'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot create resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"group":"extensions","kind":"deployments"},"code":403} During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 319, in create_or_update_deployment return self.update_deployment(name=name, body=body), False File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 309, in update_deployment name=name, namespace=self.namespace, body=body File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 4089, in patch_namespaced_deployment (data) = self.patch_namespaced_deployment_with_http_info(name, namespace, body, **kwargs) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 4189, in patch_namespaced_deployment_with_http_info collection_formats=collection_formats) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api _return_http_data_only, collection_formats, _preload_content, _request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api _request_timeout=_request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 393, in request body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 286, in PATCH body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '484'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions \"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4\" is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot patch resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"name":"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4","group":"extensions","kind":"deployments"},"code":403} During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/polyaxon/polyaxon/scheduler/tensorboard_scheduler.py", line 53, in start_tensorboard reconcile_url=get_tensorboard_reconcile_url(tensorboard.unique_name)) File "/polyaxon/polyaxon/polypod/tensorboard.py", line 234, in start_tensorboard reraise=True) File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 322, in create_or_update_deployment raise PolyaxonK8SError(e) polyaxon_k8s.exceptions.PolyaxonK8SError: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '484'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions \"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4\" is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot patch resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"name":"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4","group":"extensions","kind":"deployments"},"code":403} 
    

    To Reproduce

    $ git clone https://github.com/polyaxon/polyaxon-quick-start.git
    $ # run create, init, etc.
    $ polyaxon run -f polyaxonfile_hyperparams.yml
    $ # wait..
    $ polyaxon tensorboard -g 1 start
    

    Expected behavior

    No error.

    Environment

    Kubernetes 1.17 using Kubeadm on a local cluster.

    Let me know if you need more info.

    bug area/helm-charts 
    opened by vakker 24
  • Expose configmaps/secrets to build environment

    Expose configmaps/secrets to build environment

    Hey, I was wondering if I could expose configmaps or secrets to build jobs aswell. What I'm trying to do is add some custom apt sources along with a client cert in order to install some internal packages as dependencies. Currently we work around this by installing some packages at runtime.

    opened by Mofef 22
  • No nodes in cluster and experiments fail to build

    No nodes in cluster and experiments fail to build

    I deployed Polyaxon on Minikube (Mac) and am trying to run experiments using the polyaxon quickstart repo (https://github.com/polyaxon/polyaxon-quick-start.git). However, the experiment build keeps failing, and running 'polyaxon cluster' shows no nodes:

    Cluster info:


    major 1 minor 10 compiler gc platform linux/amd64 build_date 2018-03-26T16:44:10Z git_commit fc32d2f3698e36b93322a3465f63a14e9f0eaead go_version go1.9.3 git_version v1.10.0 git_tree_state clean


    When I run 'kubectl get pods --all-namespaces', this is the output

    NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-c4cffd6dc-42gcs 1/1 Running 0 23h kube-system etcd-minikube 1/1 Running 0 23h kube-system kube-addon-manager-minikube 1/1 Running 0 23h kube-system kube-apiserver-minikube 1/1 Running 0 23h kube-system kube-controller-manager-minikube 1/1 Running 0 23h kube-system kube-dns-86f4d74b45-652fq 3/3 Running 0 23h kube-system kube-proxy-npxr5 1/1 Running 0 23h kube-system kube-scheduler-minikube 1/1 Running 0 23h kube-system kubernetes-dashboard-6f4cfc5d87-p2z4j 1/1 Running 0 23h kube-system storage-provisioner 1/1 Running 0 23h kube-system tiller-deploy-778f674bf5-xhmsv 1/1 Running 0 23h polyaxon polyaxon-docker-registry-78d5499fc9-4wm69 1/1 Running 0 5h polyaxon polyaxon-polyaxon-api-7b97bb447d-jl6h6 2/2 Running 0 5h polyaxon polyaxon-polyaxon-beat-77fb6cccc7-lmdhw 2/2 Running 0 5h polyaxon polyaxon-polyaxon-events-79c8ff59d9-2rqcq 1/1 Running 0 5h polyaxon polyaxon-polyaxon-hpsearch-9b5589f5-874n5 1/1 Running 0 5h polyaxon polyaxon-polyaxon-k8s-events-697cf8bb65-mnjz8 1/1 Running 0 5h polyaxon polyaxon-polyaxon-logs-7bf467999-b8755 1/1 Running 0 5h polyaxon polyaxon-polyaxon-monitors-57db4f7cd7-7x2j5 2/2 Running 0 5h polyaxon polyaxon-polyaxon-resources-glgwq 1/1 Running 0 5h polyaxon polyaxon-polyaxon-scheduler-76ccf9d665-xb9bg 1/1 Running 0 5h polyaxon polyaxon-postgresql-78d4cff55c-jhcvz 1/1 Running 0 5h polyaxon polyaxon-rabbitmq-6448d76c84-vp5ll 1/1 Running 0 5h polyaxon polyaxon-redis-688468649b-tg6qp 1/1 Running 0 5h

    I have also tried running 'helm update' and upgraded polyaxon to the latest release (0.3.2). How can I troubleshoot this?

    opened by jonathanlimsc 21
  • deleted flagged missed in initialization

    deleted flagged missed in initialization

    Describe the bug

    Getting this error with version 1.1.9

    image

    To reproduce

    polyaxon upgrade && polyaxon run -f poylaxonfile

    Expected behavior

    Run completed

    Environment

    polyaxon 1.1.9

    question not-reproducible 
    opened by zeyaddeeb 20
  • Scheduling many jobs at the same time leads to zombie state jobs (possible race condition?)

    Scheduling many jobs at the same time leads to zombie state jobs (possible race condition?)

    Describe the bug

    It's hard to consistently reproduce, but when scheduling many jobs such that the build happens to be at the same time, it seems like we can get the following scenario: K8s correctly schedules the pods according to their requests/limits and the available resources. Polyaxon however believes that some jobs are running although they are unschedulable by K8s. When freeing up resources quickly enough, K8s actually schedules those jobs and nothing else happens. However, if resources are blocked long enough, Polyaxon's heartbeat service will automatically stop these jobs (that it believes are running although they are unschedulable by K8s) and fail them. To me, this could be a critical bug in the scheduler and really seems like some kind of race condition. I haven't tested it with multiple users, but I assume this would occur if many users submit different jobs at the same time (a likely scenario).

    To Reproduce

    1. Create a job with a fairly large build and long runnning time (>2000 seconds).
    2. Make sure that only two of these jobs can run on the cluster at a time (by requesting resources accordingly).
    3. Run this job many times with polyaxon run -f polyaxonfile.yml (submit this command again as soon as it terminates and repeat 5 times)

    Expected behavior

    The jobs should just be recognized as unschedulable and scheduled when the resources become available again.

    Environment

    Polyaxon 0.5.6, Kubernetes 1.15.4

    opened by MatthiasKohl 20
  • Can't use TPU

    Can't use TPU

    Describe the bug

    I tried to use Cloud TPU. But I got the error on StackDriver logging. And the experiment was failed. It seems that we need to specify tensorflow version with annotation.

    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: admission webhook \"pod-init.cloud-tpus.google.com\" denied the request: TensorFlow version must be specified in annotation \"tf-version.cloud-tpus.google.com\" for pod requesting Cloud TPUs","reason":"InternalError","details":{"causes":[{"message":"admission webhook \"pod-init.cloud-tpus.google.com\" denied the request: TensorFlow version must be specified in annotation \"tf-version.cloud-tpus.google.com\" for pod requesting Cloud TPUs"}]},"code":500}
    

    To Reproduce

    YAML

    ---
    version: 1
    
    kind: experiment
    
    environment:
      resources:
        cpu:
          requests: 4
          limits: 4
        memory:
          requests: 15000
          limits: 15000
        tpu:
          requests: 8
          limits: 8
    
    build:
      image: tensorflow/tensorflow:1.12.0
      build_steps:
        - pip install --no-cache-dir -r requirements.txt
    
    run:
      # this is just a dummy python file.
      cmd: python test.py
    

    requirements.txt

    polyaxon-client==0.3.8
    polyaxon-cli==0.3.8
    jupyter
    google-cloud-storage
    

    Expected behavior

    We can create a TPU.

    Environment

    • Polyaxon: 0.3.8

    Links

    • https://cloud.google.com/tpu/docs/kubernetes-engine-setup
    • https://github.com/tensorflow/tpu/blob/master/models/official/resnet/resnet_k8s.yaml#L28
    bug 
    opened by yu-iskw 20
  • Deploying on Kubernetes cluster created w/ Kubespray

    Deploying on Kubernetes cluster created w/ Kubespray

    Hi -

    I'm trying to spin up a Kubernetes cluster without the benefit of managed service like EKS or GKE, then deploy Polyaxon on that cluster. Currently I'm experiencing some issues on the Polyaxon side of this process.

    To deploy the Kubernetes cluster I'm using kubespray. I'm able to deploy the cluster to the point that kubectl get nodes shows the expected nodes in a ready state, and I'm able to deploy a simple Node.js app as a test. I am not, however, able to successfully install Polyaxon on the cluster.

    I've tried on both AWS and on my local machine using Vagrant/Virtualbox. The issues I'm experiencing are different between the two cases, which I find interesting, so I'll document both.

    AWS

    I deployed Kubernetes by loosely following this tutorial. Things went smoothly for the most part, except that I needed to deal with this issue using this fix. I used 3 t2.large instance running Ubuntu 16.04 and the standard kubespray configuration.

    As I mentioned above, I get the expected output from kubectl get nodes, and I'm able to deploy the Node.js app at the end of the tutorial.

    At first, the Polyaxon installation/deployment also seems to succeed:

    [email protected]:~$ helm install polyaxon/polyaxon \
    > --name=polyaxon \
    > --namespace=polyaxon \
    > -f polyaxon_config.yml
    NAME:   polyaxon
    LAST DEPLOYED: Sat Feb  9 00:03:29 2019
    NAMESPACE: polyaxon
    STATUS: DEPLOYED
    
    RESOURCES:
    ==> v1/Secret
    NAME                             TYPE    DATA  AGE
    polyaxon-docker-registry-secret  Opaque  1     3m4s
    polyaxon-postgresql              Opaque  1     3m4s
    polyaxon-rabbitmq                Opaque  2     3m4s
    polyaxon-polyaxon-secret         Opaque  4     3m4s
    
    ==> v1/ConfigMap
    NAME                      DATA  AGE
    redis-config              1     3m4s
    polyaxon-polyaxon-config  141   3m4s
    
    ==> v1beta1/ClusterRole
    NAME                           AGE
    polyaxon-polyaxon-clusterrole  3m4s
    
    ==> v1beta1/DaemonSet
    NAME                         DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR  AGE
    polyaxon-polyaxon-resources  2        2        2      2           2          <none>         3m4s
    
    ==> v1beta1/Deployment
    NAME                          DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
    polyaxon-docker-registry      1        1        1           1          3m4s
    polyaxon-postgresql           1        1        1           1          3m4s
    polyaxon-rabbitmq             1        1        1           1          3m4s
    polyaxon-redis                1        1        1           1          3m4s
    polyaxon-polyaxon-api         1        1        1           0          3m4s
    polyaxon-polyaxon-beat        1        1        1           1          3m4s
    polyaxon-polyaxon-events      1        1        1           1          3m4s
    polyaxon-polyaxon-hpsearch    1        1        1           1          3m4s
    polyaxon-polyaxon-k8s-events  1        1        1           1          3m4s
    polyaxon-polyaxon-monitors    1        1        1           1          3m4s
    polyaxon-polyaxon-scheduler   1        1        1           1          3m3s
    
    ==> v1/Pod(related)
    NAME                                           READY  STATUS   RESTARTS  AGE
    polyaxon-polyaxon-resources-hpbcv              1/1    Running  0         3m4s
    polyaxon-polyaxon-resources-m7bjv              1/1    Running  0         3m4s
    polyaxon-docker-registry-58bff6f777-vkl6h      1/1    Running  0         3m4s
    polyaxon-postgresql-f4fc68c67-25t4p            1/1    Running  0         3m4s
    polyaxon-rabbitmq-74c5d87cf6-qlk2b             1/1    Running  0         3m4s
    polyaxon-redis-6f7db88668-99qvw                1/1    Running  0         3m4s
    polyaxon-polyaxon-api-75c5989cb4-ppv7t         1/2    Running  0         3m4s
    polyaxon-polyaxon-beat-759d6f9f96-qdhmd        2/2    Running  0         3m3s
    polyaxon-polyaxon-events-86f49f8b78-vvscx      1/1    Running  0         3m4s
    polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms    1/1    Running  0         3m3s
    polyaxon-polyaxon-k8s-events-555f6c8754-c242k  1/1    Running  0         3m3s
    polyaxon-polyaxon-monitors-864dd8fb67-h7s47    2/2    Running  0         3m2s
    polyaxon-polyaxon-scheduler-7f4978774d-pm9xz   1/1    Running  0         3m2s
    
    ==> v1/ServiceAccount
    NAME                                      SECRETS  AGE
    polyaxon-polyaxon-serviceaccount          1        3m4s
    polyaxon-polyaxon-workers-serviceaccount  1        3m4s
    
    ==> v1beta1/ClusterRoleBinding
    NAME                                   AGE
    polyaxon-polyaxon-clusterrole-binding  3m4s
    
    ==> v1beta1/Role
    NAME                            AGE
    polyaxon-polyaxon-role          3m4s
    polyaxon-polyaxon-workers-role  3m4s
    
    ==> v1beta1/RoleBinding
    NAME                                    AGE
    polyaxon-polyaxon-role-binding          3m4s
    polyaxon-polyaxon-workers-role-binding  3m4s
    
    ==> v1/Service
    NAME                      TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)                                AGE
    polyaxon-docker-registry  NodePort      10.233.42.186  <none>       5000:31813/TCP                         3m4s
    polyaxon-postgresql       ClusterIP     10.233.17.56   <none>       5432/TCP                               3m4s
    polyaxon-rabbitmq         ClusterIP     10.233.33.173  <none>       4369/TCP,5672/TCP,25672/TCP,15672/TCP  3m4s
    polyaxon-redis            ClusterIP     10.233.31.108  <none>       6379/TCP                               3m4s
    polyaxon-polyaxon-api     LoadBalancer  10.233.36.234  <pending>    80:32050/TCP,1337:31832/TCP            3m4s
    

    After a few minutes all the expected pods are running:

    [email protected]:~$ kubectl get pods --namespace polyaxon
    NAME                                            READY   STATUS    RESTARTS   AGE
    polyaxon-docker-registry-58bff6f777-vkl6h       1/1     Running   0          3m49s
    polyaxon-polyaxon-api-75c5989cb4-ppv7t          1/2     Running   0          3m49s
    polyaxon-polyaxon-beat-759d6f9f96-qdhmd         2/2     Running   0          3m48s
    polyaxon-polyaxon-events-86f49f8b78-vvscx       1/1     Running   0          3m49s
    polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms     1/1     Running   0          3m48s
    polyaxon-polyaxon-k8s-events-555f6c8754-c242k   1/1     Running   0          3m48s
    polyaxon-polyaxon-monitors-864dd8fb67-h7s47     2/2     Running   0          3m47s
    polyaxon-polyaxon-resources-hpbcv               1/1     Running   0          3m49s
    polyaxon-polyaxon-resources-m7bjv               1/1     Running   0          3m49s
    polyaxon-polyaxon-scheduler-7f4978774d-pm9xz    1/1     Running   0          3m47s
    polyaxon-postgresql-f4fc68c67-25t4p             1/1     Running   0          3m49s
    polyaxon-rabbitmq-74c5d87cf6-qlk2b              1/1     Running   0          3m49s
    polyaxon-redis-6f7db88668-99qvw                 1/1     Running   0          3m49s
    

    The issue in this case arises with the LoadBalancer IP, which remains suspended in a pending state:

    [email protected]:~$ kubectl get --namespace polyaxon svc -w polyaxon-polyaxon-api
    NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
    polyaxon-polyaxon-api   LoadBalancer   10.233.52.219   <pending>     80:30684/TCP,1337:31886/TCP   13h
    
    [email protected]:~$ kubectl get svc --namespace polyaxon polyaxon-polyaxon-api -o json
    {
        "apiVersion": "v1",
        "kind": "Service",
        "metadata": {
            "creationTimestamp": "2019-02-09T01:03:11Z",
            "labels": {
                "app": "polyaxon-polyaxon-api",
                "chart": "polyaxon-0.3.8",
                "heritage": "Tiller",
                "release": "polyaxon",
                "role": "polyaxon-api",
                "type": "polyaxon-core"
            },
            "name": "polyaxon-polyaxon-api",
            "namespace": "polyaxon",
            "resourceVersion": "17172",
            "selfLink": "/api/v1/namespaces/polyaxon/services/polyaxon-polyaxon-api",
            "uid": "78640925-2c06-11e9-8f3f-121248b9afae"
        },
        "spec": {
            "clusterIP": "10.233.52.219",
            "externalTrafficPolicy": "Cluster",
            "ports": [
                {
                    "name": "api",
                    "nodePort": 30684,
                    "port": 80,
                    "protocol": "TCP",
                    "targetPort": 80
                },
                {
                    "name": "streams",
                    "nodePort": 31886,
                    "port": 1337,
                    "protocol": "TCP",
                    "targetPort": 1337
                }
            ],
            "selector": {
                "app": "polyaxon-polyaxon-api"
            },
            "sessionAffinity": "None",
            "type": "LoadBalancer"
        },
        "status": {
            "loadBalancer": {}
        }
    }
    

    Looking through the Polyaxon issues, I see that this can happen on minikube, but I wasn't able to find anything that helps me debug my particular case. What are the conditions that need to be met in the Kubernetes deployment, in order for the LoadBalancer IP step to succeed?

    Vagrant/Virtualbox

    I was suspicious that my issues might be specific to the AWS environment, rather than a general issue with kubespray/polyaxon, so as a second test I tried deploying the Kubernetes cluster locally using Vagrant and Virtualbox. To do this I used the Vagrantfile in the kubespray repo as described here.

    After debugging a couple kubespray issues, I was able to get the cluster up and running and deploy the Node.js app again.

    Deploying Polyaxon, I again saw the issue w/ the LoadBalancer IP getting stuck in a pending state. What was interesting to me though, was that a number of pods actually failed to run as well, despite the fact that the deployment ostensibly succeeded:

    [email protected]:~$ helm ls
    NAME            REVISION        UPDATED                         STATUS          CHART           APP VERSION     NAMESPACE
    polyaxon        1               Sat Feb  9 06:01:21 2019        DEPLOYED        polyaxon-0.3.8                  polyaxon
    
    [email protected]:~$ kubectl get pods --namespace polyaxon
    NAME                                           READY   STATUS    RESTARTS   AGE
    polyaxon-docker-registry-58bff6f777-wlb9p      0/1     Pending   0          36m
    polyaxon-polyaxon-api-6bc75ff4ff-v694k         0/2     Pending   0          36m
    polyaxon-polyaxon-beat-744c96b9f8-mbz5j        0/2     Pending   0          36m
    polyaxon-polyaxon-events-58d9c9cbd6-72skt      0/1     Pending   0          36m
    polyaxon-polyaxon-hpsearch-dc9cf6556-8rh78     0/1     Pending   0          36m
    polyaxon-polyaxon-k8s-events-9f8cdf5-fvqnx     0/1     Pending   0          36m
    polyaxon-polyaxon-monitors-58766747c9-gcf2r    0/2     Pending   0          36m
    polyaxon-polyaxon-resources-rnntm              1/1     Running   0          36m
    polyaxon-polyaxon-resources-t4pv6              0/1     Pending   0          36m
    polyaxon-polyaxon-resources-x9f42              0/1     Pending   0          36m
    polyaxon-polyaxon-scheduler-76bfdcfcc7-d9tq4   0/1     Pending   0          36m
    polyaxon-postgresql-f4fc68c67-lwgds            1/1     Running   0          36m
    polyaxon-rabbitmq-74c5d87cf6-lhvj8             1/1     Running   0          36m
    polyaxon-redis-6f7db88668-6wlgs                1/1     Running   0          36m
    

    I'm not quite sure what's going on here. My best guess would be that the virtual machines don't have the necessary resources to run these pods? ... Would be interesting to hear the experts weigh in 😄.

    Please help!

    opened by jayleverett 20
  • polyaxon/polyaxon-api is start but no service on

    polyaxon/polyaxon-api is start but no service on

    docker log

    Running...
    Use default user
    nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
    nginx: configuration file /etc/nginx/nginx.conf test is successful
    Restarting nginx: nginx.
    nginx is running.
    [uWSGI] getting INI configuration from web/uwsgi.nginx.ini
    *** Starting uWSGI 2.0.18 (64bit) on [Tue Aug 18 08:34:22 2020] ***
    compiled with version: 6.3.0 20170516 on 13 August 2020 13:15:05
    os: Linux-4.18.0-193.el8.x86_64 #1 SMP Fri May 8 10:59:10 UTC 2020
    nodename: polyaxon-polyaxon-api-5c8f885949-wjq9p
    machine: x86_64
    clock source: unix
    pcre jit disabled
    detected number of CPU cores: 4
    current working directory: /polyaxon
    detected binary path: /usr/local/bin/uwsgi
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    chdir() to /polyaxon/web/..
    your memory page size is 4096 bytes
    detected max file descriptor number: 1048576
    lock engine: pthread robust mutexes
    thunder lock: enabled
    uwsgi socket 0 bound to UNIX address /polyaxon/web/../web/polyaxon.sock fd 3
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    Python version: 3.7.6 (default, Jan  3 2020, 23:53:24)  [GCC 6.3.0 20170516]
    Python main interpreter initialized at 0x5626c4254800
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    python threads support enabled
    your server socket listen backlog is limited to 100 connections
    your mercy for graceful operations on workers is 60 seconds
    mapped 425960 bytes (415 KB) for 4 cores
    *** Operational MODE: preforking ***
    added /polyaxon/web/../polyaxon/ to pythonpath.
    WSGI app 0 (mountpoint='') ready in 2 seconds on interpreter 0x5626c4254800 pid: 66 (default app)
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    *** uWSGI is running in multiple interpreter mode ***
    spawned uWSGI master process (pid: 66)
    spawned uWSGI worker 1 (pid: 72, cores: 1)
    spawned uWSGI worker 2 (pid: 73, cores: 1)
    spawned uWSGI worker 3 (pid: 74, cores: 1)
    spawned uWSGI worker 4 (pid: 75, cores: 1)
    

    docker image

    polyaxon/polyaxon-gateway                                        1.1.7                 a52bd2a3a36d        4 days ago          473MB
    polyaxon/polyaxon-api                                            1.1.7                 dc1d59a6bff9        4 days ago          590MB
    polyaxon/polyaxon-cli                                            1.1.7                 5ea8e132a2a0        4 days ago          419MB
    

    kubectl --namespace=polyaxon get pod

    NAME                                          READY   STATUS    RESTARTS   AGE
    polyaxon-polyaxon-api-5c8f885949-wjq9p        0/1     Running   4          30m
    polyaxon-polyaxon-gateway-77c4d46d4d-t85ww    1/1     Running   0          30m
    polyaxon-polyaxon-operator-7f48b54676-mh48l   1/1     Running   0          30m
    polyaxon-polyaxon-streams-7c4876dc54-jh2p6    1/1     Running   0          30m
    polyaxon-postgresql-0                         1/1     Running   0          30m
    

    helm version

    Client: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}
    Server: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}
    

    kubectl version

    Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:48:36Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
    
    question 
    opened by zhangchunsheng 19
  • Logs are not displayed correctly in terminal

    Logs are not displayed correctly in terminal

    Describe the bug

    Unable to see the logs correctly. Unfortunately the only things visible within in terminal are callback errors:

    $ polyaxon experiment -xp X logs
    building -- 
    scheduled -- 
    starting -- 
    running -- 
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    ...
    error from callback <bound method SocketTransportMixin._on_close of <polyaxon_client.transport.Transport object at 0x7fd723190978>>: _on_close() missing 1 required positional argument: 'ws'
    

    To Reproduce

    Started experiment with polyaxon run -u and then started the logs-view polyaxon experiment -xp X logs

    Experiment:

    https://github.com/polyaxon/polyaxon-examples/tree/master/tensorflow/cifare10/polyaxonfile.yml

    Expected behavior

    Building -- creating image -
      master.1 -- INFO:tensorflow:Using config: {'_model_dir': '/outputs/root/cifar10/experiments/1', '_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_session_config': gpu_options {
      master.1 --   force_gpu_compatible: true
      master.1 -- }
    

    Environment

    Local

    polyaxon is running within a virtualenv using python3.

    Cluster

    OS: Ubuntu 18.04 Kubernetes: 1.12.1

    bug 
    opened by naetherm 19
  • "cluster-admin not found" error while installing polyaxon with helm

    I am using minikube to set up a local kubernetes single node cluster. I have set up helm as described in the docs. But when I try to deploy polyaxon by following the docs, I get an error.

    temp-training:~ shivam.m$ helm install --wait polyaxon/polyaxon Error: release rousing-peahen failed: clusterroles.rbac.authorization.k8s.io "rousing-peahen-polyaxon-ingress-clusterrole" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:["configmaps"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["configmaps"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["secrets"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["secrets"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["get"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["list"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["watch"]} PolicyRule{Resources:["events"], APIGroups:[""], Verbs:["create"]} PolicyRule{Resources:["events"], APIGroups:[""], Verbs:["patch"]} PolicyRule{Resources:["ingresses/status"], APIGroups:["extensions"], Verbs:["update"]}] user=&{system:serviceaccount:kube-system:tiller 8e197f15-1373-11e8-9b02-080027bbca2c [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated] map[]} ownerrules=[] ruleResolutionErrors=[clusterroles.rbac.authorization.k8s.io "cluster-admin" not found]

    I tried disabling the rbac and running it again but then I get an error related to port allocation. temp-training:~ shivam.m$ helm install --set=rbac.enabled=false polyaxon/polyaxon Error: release mortal-gorilla failed: Service "mortal-gorilla-docker-registry" is invalid: spec.ports[0].nodePort: Invalid value: 31813: provided port is already allocated

    bug 
    opened by codophobia 19
  • Unable to run experiments with v1.1.8

    Unable to run experiments with v1.1.8

    Describe the bug

    Unable to run experiments with new version 1.1.8. "Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f168f918700>: Failed to establish a new connection: [Errno 111] Connection refused')" Seems to be from tracking.init()

    Also when running polyaxon project ls (only the first time):

    Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dbe0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
    Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dc88>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
    Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dd68>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
    Could not connect to remote server to fetch compatibility versions.
    Checking CLI compatibility version ...
    Could get the min/latest versions from compatibility API.
    

    However if I run it again it works as expected.

    To Reproduce

    version: 1.1
    kind: component
    name: simple-experiment
    description: Minimum information to run this TF.Keras example
    tags: [examples]
    run:
      kind: job
      init:
      - git: {url: "https://github.com/polyaxon/polyaxon-quick-start"}
        container:
          env:
            - name: http_proxy
              value: "***"
            - name: https_proxy
              value: "***"
      container:
        image: polyaxon/polyaxon-quick-start
        workingDir: "{{ globals.artifacts_path }}/polyaxon-quick-start"
        command: [python3, model.py]
        env:
          - name: http_proxy
            value: "***"
          - name: https_proxy
            value: "***"
    

    Expected behavior

    A running experiment.

    Environment

    deploymentChart: platform
    deploymentVersion: 1.1.8
    
    artifactsStore:
      name: minio
      kind: s3
      schema: {"bucket": "***"}
      secret:
        name: "***"
    
    connections:
      - name: data
        kind: volume_claim
        schema:
          mountPath: ***
          volumeClaim: ***
          readOnly: true
    
    scheduler:
      enabled: true
    
    streams:
      enabled: true
    
    postgresql:
      persistence:
        enabled: true
        storageClass: nfs
    
    redis:
      enabled: true
      master:
        persistence:
          enabled: true
          storageClass: nfs
      slave:
        persistence:
          enabled: true
          storageClass: nfs
    broker: redis
    
    rabbitmq-ha:
      enabled: false
    
    ui:
      enabled: true
      adminEnabled: true
    
    bug regression 
    opened by ONordander 17
  • How to build the polyaxon/polyaxon-operator image?

    How to build the polyaxon/polyaxon-operator image?

    Describe the problem

    How to build the polyaxon/polyaxon-operator image? I didn't find the releated dockerfile. Is it build from https://github.com/polyaxon/mloperator?

    opened by hongqing1986 0
  • Polyaxon can't get plxlogs for pytorchjob in dashboard

    Polyaxon can't get plxlogs for pytorchjob in dashboard

    When i run the pytorchjob, i can't get the plxlogs in dashboard after the job finished,

    But if i clicked the logs button of this job in dashboard before the job finished, I can collect the plxlogs.

    If i run the common job, the plxlogs is normal.

    yaml config

    version: 1 kind: component tags: [examples, pytorch, kubeflow] run: kind: pytorchjob master: replicas: 1 init: - git: {"url": "https://github.com/polyaxon/polyaxon-examples"} container: image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"] resources: requests: nvidia.com/gpu: 1 worker: replicas: 1 init: - git: {"url": "https://github.com/polyaxon/polyaxon-examples"} container: image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"] resources: requests: nvidia.com/gpu: 1

    question 
    opened by hongqing1986 15
  • Preserve an artifact staged status when copied / transferred across projects

    Preserve an artifact staged status when copied / transferred across projects

    Use case

    There's the immediate feature as the title describes and there a general use case of including model / artifact --name in plx models stage when there're multiple models with many versions where two models with different names can have the same version (there shouldn't be a restriction on version names between different models) and when one wants to change the status of a single one via

    plx models stage -p PROJECT -n MODEL -ver rc0 -to=production
    

    where -n I mean the artifact name (i.e. the name used in plx model register --artifact)

    Alternatives

    is to plx models stage again but is not possible in the global model registry where there're multiple models since plx models stage doesn't have --name corresponding to the model / artifact name.

    question area/cli area/registry 
    opened by ehsanmok 1
  • Polyaxon Operator HA

    Polyaxon Operator HA

    Use case

    After network outage of node where polyaxon-operator was working, polyaxon-operator did not came back working correctly. It was stuck in leader election, which made it stop working at all. E1206 17:17:24.588551 1 leaderelection.go:357] Failed to update lock: Put "https://10.40.0.1:443/apis/coordination.k8s.io/v1/namespaces/polyaxon-websensa/leases/ops.core.polyaxon.com": context deadline exceeded I1206 17:18:42.637003 1 leaderelection.go:278] failed to renew lease polyaxon-websensa/ops.core.polyaxon.com: timed out waiting for the condition

    Feature description

    Polyaxon-operator should be highly available.

    Polyaxon CE 1.12.3 Kubernetes 1.19.8

    question area/operator 
    opened by boniek83 1
Releases(v1.12.2)
Owner
polyaxon
A platform for reproducible and scalable machine learning and deep learning on kubernetes
polyaxon
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Jul 30, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Fatih Küçükkarakurt 5 Apr 5, 2022
Distributed machine learning platform

Veles Distributed platform for rapid Deep learning application development Consists of: Platform - https://github.com/Samsung/veles Znicz Plugin - Neu

Samsung 898 Jul 25, 2022
Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for real-time gesture recognition.

Gesture Recognition Toolkit (GRT) The Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for re

Nicholas Gillian 780 Aug 5, 2022
An Open Source Machine Learning Framework for Everyone

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

null 166.8k Aug 3, 2022
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

Siavash Eliasi 33 May 31, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.7k Aug 12, 2022
A lightweight C++ machine learning library for embedded electronics and robotics.

Fido Fido is an lightweight, highly modular C++ machine learning library for embedded electronics and robotics. Fido is especially suited for robotic

The Fido Project 412 Jun 25, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.1k Aug 10, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Aug 1, 2022
Feature Store for Machine Learning

Overview Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production. Please see ou

Feast 3.5k Aug 11, 2022
In-situ data analyses and machine learning with OpenFOAM and Python

PythonFOAM: In-situ data analyses with OpenFOAM and Python Using Python modules for in-situ data analytics with OpenFOAM 8. NOTE that this is NOT PyFO

Argonne Leadership Computing Facility - ALCF 105 Aug 5, 2022
CNStream is a streaming framework for building Cambricon machine learning pipelines

CNStream is a streaming framework for building Cambricon machine learning pipelines

Cambricon Technologies 174 Aug 4, 2022
SecMML: Secure MPC(multi-party computation) Machine Learning Framework

SecMML 介绍 SecMML是FudanMPL(Multi-Party Computation + Machine Learning)的一个分支,是用于训练机器学习模型的高效可扩展的安全多方计算(MPC)框架,基于BGW协议实现。此框架可以应用到三个及以上参与方联合训练的场景中。目前,SecMM

null 77 Jul 14, 2022
In this tutorial, we will use machine learning to build a gesture recognition system that runs on a tiny microcontroller, the RP2040.

Pico-Motion-Recognition This Repository has the code used on the 2 parts tutorial TinyML - Motion Recognition Using Raspberry Pi Pico The first part i

Marcelo Rovai 16 Jun 18, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

NetEase Youdao 176 Jul 21, 2022
A flexible, high-performance serving system for machine learning models

XGBoost Serving This is a fork of TensorFlow Serving, extended with the support for XGBoost, alphaFM and alphaFM_softmax frameworks. For more informat

iQIYI 120 Aug 1, 2022
Examples for using ONNX Runtime for machine learning inferencing.

Examples for using ONNX Runtime for machine learning inferencing.

Microsoft 269 Aug 4, 2022
Provide sample code of efficient operator implementation based on the Cambrian Machine Learning Unit (MLU) .

Cambricon CNNL-Example CNNL-Example 提供基于寒武纪机器学习单元(Machine Learning Unit,MLU)开发高性能算子、C 接口封装的示例代码。 依赖条件 操作系统: 目前只支持 Ubuntu 16.04 x86_64 寒武纪 MLU SDK: 编译和

Cambricon Technologies 1 Mar 7, 2022