Dorylus: Affordable, Scalable, and Accurate GNN Training

Overview

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

This is Dorylus, a Scalable, Resource-efficient & Affordable computation system for Graph Neural Networks, built upon an architecture combining cheap data servers on AWS EC2 with serverless computing on AWS Lambda Threads.

Dataserver originally is a push-based ASPIRE implementation, a cleaned up version of gift (forked on July 06, 2016). Implemented streaming-like processing as in Tornado (SIGMOD'16) paper.

Now the main logic of the engine has been completely simplified, and we integrate it with AWS Lambda threads. Ultimate goal is to achieve "Affordable AI" with the benefit of cheap scalability brought by serverless computing.

Check out our OSDI'21 paper for details of the design.

User Guide

Check our Wiki page for managing your EC2 clusters, building & running Dorylus.

Issues
  • Execution is stuck at Epoch 1

    Execution is stuck at Epoch 1

    Dear maintainers, I'm running dorylus on lambda right now. My graph server master is stuck at Epoch 1, and then it returns: "[ Node 0 ] [ FUNC ERROR ] Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds". My lambda prints no log so I have no idea what's going on. Could you help me solve the problem? Thank you very much!

    More details are presented below.

    My Cluster

    • graphserver 2 EC2 c5n.2xlarge us-east-1f

    • weightserver 1 EC2 t2.medium us-east-1c

    • lambda Named "gcn". Memory 192MB, timeout 10min.

      It is configured to be in the same VPC of the graphservers and weightserver (but GSs and WS are not in the same subnet, as is shown above).

    My Input Dataset and Configs

    Actually I don't have much knowledge about machine learning, so I use the example simple graph presented in the tutorial as the input dataset.

    • small.graph

      0 1
      0 2
      1 3
      2 4
      3 5
      
    • feature

      used by ./prepare is 4

      0.3, 0.2, 0.5, 0.7
      0.7, 0.5, 0.5, 0.3
      0.1, 0.2, 0.8, 0.7
      0.3, 0.4, 0.5, 0.1
      0.3, 0.4, 0.2, 0.1
      0.3, 0.6, 0.5, 0.8
      
    • labels

      I set the argument of ./prepare to 1

      0
      3
      2
      1
      2
      3
      
    • layerconfig (renamed to simplegraph.config later)

      4
      8
      10
      
    • Other configs All other configs (e.g., gserverport, nodeport) use the default value.

    Commands and Full Logs

    • Command:

      • graphserver ./run/run-onnode graph simplegraph --l 5 --e 5
      • weightserver ./run/run-onnode weight simplegraph
    • Full Logs

      • graphserver

        [ Node 404 ]  Engine starts initialization...
        [ Node 404 ]  Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /filepool/simplegraph/parts_2/, featuresFile = /filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, myPrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set -> 6000
        [ Node 404 ]  NodeManager starts initialization...
        [ Node   1 ]  Private IP: 172.31.73.108
        [ Node 404 ]  Engine starts initialization...
        [ Node 404 ]  Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /filepool/simplegraph/parts_2/, featuresFile = /filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, myPrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set -> 6000
        [ Node 404 ]  NodeManager starts initialization...
        [ Node   0 ]  Private IP: 172.31.79.216
        [ Node   0 ]  NodeManager initialization complete.
        [ Node   0 ]  CommManager starts initialization...
        [ Node   1 ]  NodeManager initialization complete.
        [ Node   0 ]  CommManager starts initialization...
        [ Node   0 ]  CommManager initialization complete.
        [ Node   1 ]  CommManager initialization complete.
        [ Node   0 ]  Preprocessing... Output to /filepool/simplegraph/parts_2/graph.0.bin
        [ Node   1 ]  Preprocessing... Output to /filepool/simplegraph/parts_2/graph.1.bin
        Cannot open output file:/filepool/simplegraph/parts_2/graph.0.bin, [Reason: Permission denied]
        Cannot open input file: /filepool/simplegraph/parts_2/graph.0.bin, [Reason: No such file or directory]
        [ Node   0 ]  Finish preprocessing!
        [ Node   0 ]  <GM>: 0 global vertices, 0 global edges,
                        0 local vertices, 0 local in edges, 0 local out edges
                        0 out ghost vertices, 0 in ghost vertices
        [ Node   0 ]  No feature cache, loading raw data...
        [ Node   0 ]  Cannot open output cache file: /filepool/simplegraph/parts_2/feats4.0.bin [Reason: Permission denied]
        Cannot open output file:/filepool/simplegraph/parts_2/graph.1.bin, [Reason: Permission denied]
        Cannot open input file: /filepool/simplegraph/parts_2/graph.1.bin, [Reason: No such file or directory]
        [ Node   1 ]  Finish preprocessing!
        [ Node   1 ]  <GM>: 0 global vertices, 0 global edges,
                        0 local vertices, 0 local in edges, 0 local out edges
                        0 out ghost vertices, 0 in ghost vertices
                        [ Node   1 ]  No feature cache, loading raw data...
        [ Node   1 ]  Cannot open output cache file: /filepool/simplegraph/parts_2/feats4.1.bin [Reason: Permission denied]
        [ Node   1 ]  Engine initialization complete.
        [ Node   0 ]  Engine initialization complete.
        [ Node   0 ]  Number of epochs: 5, validation frequency: 1
        [ Node   0 ]  Sync Epoch 1 starts...
        [ Node   1 ]  Sync Epoch 1 starts...
        [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds
        [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.938Z 548bbee5-c454-479f-956c-baeb0e8e239b Task timed out after 600.02 seconds
        [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.993Z 5d6f2b65-2386-4169-9868-e43f0c5a0361 Task timed out after 600.10 seconds
        [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.001Z 402b364b-4579-40f4-94a5-03f5efde62c0 Task timed out after 600.10 seconds
        [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.004Z 285502c8-06e2-4585-8a81-983110962a8e Task timed out after 600.10 seconds
        [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.020Z a93af084-45b9-4a86-b0a9-6b3bcc37cb06 Task timed out after 600.10 seconds
        [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.032Z fe90565a-f474-4c29-84c9-1cae963f3d63 Task timed out after 600.09 seconds
        [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.053Z e40b2790-0c6a-4c28-9bec-dc2075b0a5ce Task timed out after 600.02 seconds
        [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.018Z f85d186c-56f3-4cc8-87bf-36aa28034f23 Task timed out after 600.10 seconds
        [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.075Z 627871c8-fa47-4812-af2e-11b9cc7da5fd Task timed out after 600.02 seconds
        
      • weightserver

        Killing existing 'weightserver' processes... 
        weightserver: no process found                                                                             [24/1922]Running WEIGHT servers with: [ MARK # 12 ]...                                                                       ./build/weightserver /home/ubuntu/dshmachines /home/ubuntu/myprip /home/ubuntu/gserverip 5000 65433 55431 /home/ubuntu/simplegraph.config /home/ubuntu/tmpfiles 1 1 0 GCN 0.01 0.02                                                     Binding weight server to tcp://*:65433...                                                                           [ WS   0 ] Initializing nodes...
        [ WS   0 ] All weight servers connected.
        [ WS   0 ] Layer 0 - Weights: (4, 8)
        [ WS   0 ] Layer 1 - Weights: (8, 10)
        [ WS   0 ] All nodes up to date.
        [  INFO  ] Number of lambdas set to 10.
        
      • Lambda

        No log is presented in CloudWatch.

    opened by CodingYuanLiu 8
  • Question: how to run dorylus on other platforms

    Question: how to run dorylus on other platforms

    Dear maintainers,

    I'm using another faas platform rather than AWS Lambda and I would like to run dorylus on other serverless platforms like openwhisk. However, I find that the implementation of dorylus is tightly coupled with AWS Lambda.

    Could you please give me some advice on how to run dorylus on other platforms? E.g., what files should I modify? How much time or effort I need to take approximately?

    Thank you!

    opened by CodingYuanLiu 2
  • What specific kind of Lambda trigger did you use as its event and why C++?

    What specific kind of Lambda trigger did you use as its event and why C++?

    Hi! It's my pleasure to appreciate your great work and I'm curious about this project's implementation.

    I'm wondering what specific kind of Lambda trigger did you use to start it? Thanks. (Specifically, AWS Lambda is an event-based system, the trigger could be HTTP requests and so on.)

    On the other hand, why did you use c++ for Lambda computing? According to the link below, I did not find C++ as a programming language for Lambda. Could you please explain your idea of using C++ and how did you run C++ code on Lambda Threads? Thanks a lot. https://aws.amazon.com/lambda/faqs/#:~:text=Q%3A%20What%20languages%20does%20AWS%20Lambda%20support%3F

    Best.

    opened by Lorisyy 2
  • Question about the loading of Friendster

    Question about the loading of Friendster

    @ivanium @kevalvora @josehu07 Hi authors, Good work! I use the c5n.4xlarge instance. However, I find the command to prepare dataset

    ./prepare <PathToRawGraph> <Undirected? (0/1)> <NumVertices> <NumPartitions> <PathToRawFeatures> <DimFeatures> <PathToRawLabels> <LabelKinds>
    

    There is an Out of Memory error.(42GB is not enough) To run friendster, do I need to use a large memory node to make the partition first? And then distribute those partition to each node, right? How large should the RAM be?

    Thanks for the help!

    opened by TracyRixner 1
  • Can we run dorylus without ec2man?

    Can we run dorylus without ec2man?

    Hey authors, @kevalvora @ivanium @josehu07 @redhairdragon @johnnt849 Thank you for making dorylus open-sourced and write this detailed wiki!

    I am wondering if it is possible to run dorylus without using ec2man? Basically I launch some instances on AWS, and compile weightserver and graphserver into binary files and directly run on that. Is that possible?

    Also for the build--upload-lambda-functions, I cannot find the example forward-prop-josehu, can you give me a pointer for that?

    Do I need to strictly follow the instance required here for CPU/GPU backend? Can I launch other type of instance?

        For the serverless based backend: ami-07aec0eb32327b38d
        For the CPU based backend: ami-04934e5a63d144d88
        For the GPU based backend: ami-01c390943eecea45c
        For the Weight Server: ami-0901fc9a7bc310a8a
    

    Thanks!

    opened by yuxineverforever 1
Owner
UCLASystem
UCLA SOLAR Lab
UCLASystem
HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)

Merlin: HugeCTR HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-T

null 668 Jun 24, 2022
Training and fine-tuning YOLOv4 Tiny on custom object detection dataset for Taiwanese traffic

Object Detection on Taiwanese Traffic using YOLOv4 Tiny Exploration of YOLOv4 Tiny on custom Taiwanese traffic dataset Trained and tested AlexeyAB's D

Andrew Chen 3 Mar 7, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator compatible with deep learning frameworks, PyTorch and TensorFlow/Keras, as well as classical machine learning libraries such as scikit-learn, and more.

Microsoft 7k Jun 25, 2022
ResNet Implementation, Training, and Inference Using LibTorch C++ API

LibTorch C++ ResNet CIFAR Example Introduction ResNet implementation, training, and inference using LibTorch C++ API. Because there is no native imple

Lei Mao 20 Jun 20, 2022
Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase.

CFace Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase. Dependancies Tensorflow 2

null 8 Nov 23, 2021
Reactive Light Training Module used in fitness for developing agility and reaction speed.

Hello to you , Thanks for taking interest in this project. Use case of this project is to help people that want to improve their agility and reactio

null 1 Oct 31, 2021
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 18 Jun 16, 2022
Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict outcomes.

Linear-Regression Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict o

vincent laizer 1 Nov 3, 2021
A system to flag anomalous source code expressions by learning typical expressions from training data

A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to niran

Intel Labs 1.2k Jun 27, 2022
Efficient training of deep recommenders on cloud.

HybridBackend Introduction HybridBackend is a training framework for deep recommenders which bridges the gap between evolving cloud infrastructure and

Alibaba 86 Jun 23, 2022
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales

Fairring (FAIR + Herring): a faster all-reduce TL;DR: Using a variation on Amazon’s "Herring" technique, which leverages reduction servers, we can per

Meta Research 44 Jun 21, 2022
Nvvl - A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training

NVVL is part of DALI! DALI (Nvidia Data Loading Library) incorporates NVVL functionality and offers much more than that, so it is recommended to switc

NVIDIA Corporation 657 Jun 9, 2022
A library for distributed ML training with PyTorch

moolib moolib - a communications library for distributed ML training moolib offers general purpose RPC with automatic transport selection (shared memo

Meta Research 329 Jun 20, 2022
Weekly competitive programming training for newbies (Codeforces problem set)

Codeforces Basic Problem Set Weekly competitive programming training for newbies based on the Codeforces problem set. Note that, this training problem

Nguyen Hoang Hai 4 Apr 22, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 22.9k Jun 24, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jun 24, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.6k Jul 1, 2022
Caffe2 is a lightweight, modular, and scalable deep learning framework.

Source code now lives in the PyTorch repository. Caffe2 Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the origin

Meta Archive 8.4k Jun 22, 2022