Large scale embeddings on a single machine.

Related tags

marius
Overview

Marius

Marius is a system under active development for training embeddings for large-scale graphs on a single machine.

Training on large scale graphs requires a large amount of data movement to get embedding parameters from storage to the computational device. Marius is designed to mitigate/reduce data movement overheads using:

  • Pipelined training and IO
  • Partition caching and locality-aware data orderings

Details on how Marius works can be found in the documentation, or in our OSDI 21' Paper.

Requirements

(Other versions may work, but are untested)

  • Ubuntu 18.04 or MacOS 10.15
  • CUDA 10.1 or 10.2 (If using GPU training)
  • CuDNN 7 (If using GPU training)
  • 1.7 >= pytorch
  • python >= 3.6
  • pip >= 21
  • GCC >= 9 (On Linux) or Clang 12.0 (On MacOS)
  • cmake >= 3.12
  • make >= 3.8

Installation from source with Pip

  1. Install latest version of PyTorch for your CUDA version:

    Linux:

    • CUDA 10.1: python3 -m pip install torch==1.7.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
    • CUDA 10.2: python3 -m pip install torch==1.7.1
    • CPU Only: python3 -m pip install torch==1.7.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

    MacOS:

    • CPU Only: python3 -m pip install torch==1.7.1
  2. Clone the repository git clone https://github.com/marius-team/marius.git

  3. Build and install Marius cd marius; python3 -m pip install .

Full script (without torch install)

git clone https://github.com/marius-team/marius.git
cd marius
python3 -m pip install .

Installation from source with CMake

  1. Clone the repository git clone https://github.com/marius-team/marius.git

  2. Install dependencies cd marius; python3 -m pip install -r requirements.txt

  3. Create build directory mkdir build; cd build

  4. Run cmake in the build directory cmake ../ (CPU-only build) or cmake ../ -DUSE_CUDA=1 (GPU build)

  5. Make the marius executable. make marius_train -j

Full script (without torch install)

git clone https://github.com/marius-team/marius.git
cd marius
python3 -m pip install -r requirements.txt
mkdir build
cd build
cmake ../ -DUSE_CUDA=1
make -j

Training a graph

Training embeddings on a graph requires three steps.

  1. Define a configuration file. This example will use the config already defined in examples/training/configs/fb15k_gpu.ini

    See docs/configuration.rst for full details on the configuration options.

  2. Preprocess the dataset marius_preprocess fb15k output_dir/

    The first argument of marius/tools/preprocess.py defines the dataset we wish to download and preprocess, in this case fb15k. The second argument tells the preprocessor where to put the preprocessed dataset.

  3. Run the training executable with the config file. marius_train examples/training/configs/fb15k_gpu.ini

The output of the first epoch should be similar to the following.

[info] [03/18/21 01:33:18.778] Metadata initialized
[info] [03/18/21 01:33:18.778] Training set initialized
[info] [03/18/21 01:33:18.779] Evaluation set initialized
[info] [03/18/21 01:33:18.779] Preprocessing Complete: 2.605s
[info] [03/18/21 01:33:18.791] ################ Starting training epoch 1 ################
[info] [03/18/21 01:33:18.836] Total Edges Processed: 40000, Percent Complete: 0.082
[info] [03/18/21 01:33:18.862] Total Edges Processed: 80000, Percent Complete: 0.163
[info] [03/18/21 01:33:18.892] Total Edges Processed: 120000, Percent Complete: 0.245
[info] [03/18/21 01:33:18.918] Total Edges Processed: 160000, Percent Complete: 0.327
[info] [03/18/21 01:33:18.944] Total Edges Processed: 200000, Percent Complete: 0.408
[info] [03/18/21 01:33:18.970] Total Edges Processed: 240000, Percent Complete: 0.490
[info] [03/18/21 01:33:18.996] Total Edges Processed: 280000, Percent Complete: 0.571
[info] [03/18/21 01:33:19.021] Total Edges Processed: 320000, Percent Complete: 0.653
[info] [03/18/21 01:33:19.046] Total Edges Processed: 360000, Percent Complete: 0.735
[info] [03/18/21 01:33:19.071] Total Edges Processed: 400000, Percent Complete: 0.816
[info] [03/18/21 01:33:19.096] Total Edges Processed: 440000, Percent Complete: 0.898
[info] [03/18/21 01:33:19.122] Total Edges Processed: 480000, Percent Complete: 0.980
[info] [03/18/21 01:33:19.130] ################ Finished training epoch 1 ################
[info] [03/18/21 01:33:19.130] Epoch Runtime (Before shuffle/sync): 339ms
[info] [03/18/21 01:33:19.130] Edges per Second (Before shuffle/sync): 1425197.8
[info] [03/18/21 01:33:19.130] Edges Shuffled
[info] [03/18/21 01:33:19.130] Epoch Runtime (Including shuffle/sync): 339ms
[info] [03/18/21 01:33:19.130] Edges per Second (Including shuffle/sync): 1425197.8
[info] [03/18/21 01:33:19.148] Starting evaluating
[info] [03/18/21 01:33:19.254] Pipeline flush complete
[info] [03/18/21 01:33:19.271] Num Eval Edges: 50000
[info] [03/18/21 01:33:19.271] Num Eval Batches: 50
[info] [03/18/21 01:33:19.271] Auc: 0.973, Avg Ranks: 24.477, MRR: 0.491, [email protected]: 0.357, [email protected]: 0.651, [email protected]: 0.733, [email protected]: 0.806, [email protected]: 0.895, [email protected]: 0.943

To train using CPUs only, use the examples/training/configs/fb15k_cpu.ini configuration file instead.

Using the Python API

Sample Code

Below is a sample python script which trains a single epoch of embeddings on fb15k.

import marius as m
from marius.tools import preprocess

def fb15k_example():

    preprocess.fb15k(output_dir="output_dir/")
    
    config_path = "examples/training/configs/fb15k_cpu.ini"
    config = m.parseConfig(config_path)

    train_set, eval_set = m.initializeDatasets(config)

    model = m.initializeModel(config.model.encoder_model, config.model.decoder_model)

    trainer = m.SynchronousTrainer(train_set, model)
    evaluator = m.SynchronousEvaluator(eval_set, model)

    trainer.train(1)
    evaluator.evaluate(True)


if __name__ == "__main__":
    fb15k_example()

Marius in Docker

Marius can be deployed within a docker container. Here is a sample ubuntu dockerfile (located at examples/docker/dockerfile) which contains the necessary dependencies preinstalled for GPU training.

Building and running the container

Build an image with the name marius and the tag example:
docker build -t marius:example -f examples/docker/dockerfile examples/docker

Create and start a new container instance named gaius with:
docker run --name gaius -itd marius:example

Run docker ps to verify the container is running

Start a bash session inside the container:
docker exec -it gaius bash

Sample Dockerfile

See examples/docker/dockerfile

FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
RUN apt update

RUN apt install -y g++ \ 
         make \
         wget \
         unzip \
         vim \
         git \
         python3-pip

# install gcc-9
RUN apt install -y software-properties-common
RUN add-apt-repository -y ppa:ubuntu-toolchain-r/test
RUN apt update
RUN apt install -y gcc-9 g++-9
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 9
RUN update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 9

# install cmake 3.20
RUN wget https://github.com/Kitware/CMake/releases/download/v3.20.0/cmake-3.20.0-linux-x86_64.sh
RUN mkdir /opt/cmake
RUN sh cmake-3.20.0-linux-x86_64.sh --skip-license --prefix=/opt/cmake/
RUN ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake

# install pytorch
RUN python3 -m pip install torch==1.7.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Citing Marius

@inproceedings {MariusOSDI2021,
    author = {Jason Mohoney and Roger Waleffe and Yiheng Xu and Theodoros Rekatsinas and Shivaram Venkataraman},
    title = {Marius: Learning Massive Graph Embeddings on a Single Machine},
    booktitle = {15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21)},
    year = {2021},
    publisher = {{USENIX} Association},
}
Issues
  • README example not working

    README example not working

    Describe the bug

    Traceback (most recent call last):
      File "/Users/cthoyt/dev/marius/test.py", line 20, in <module>
        fb15k_example()
      File "/Users/cthoyt/dev/marius/test.py", line 8, in fb15k_example
        train_set, eval_set = m.initializeDatasets(config)
    RuntimeError: filesystem error: in copy_file: No such file or directory [training_data/marius/edges/train/edges.bin] [output_dir/train_edges.pt]
    

    To Reproduce

    I took the example from the README verbatim besides fixing the config path

    import marius as m
    
    def fb15k_example():
        config_path = "/Users/cthoyt/dev/marius/examples/training/configs/kinships_cpu.ini"
        config = m.parseConfig(config_path)
    
        train_set, eval_set = m.initializeDatasets(config)
    
        model = m.initializeModel(config.model.encoder_model, config.model.decoder_model)
    
        trainer = m.SynchronousTrainer(train_set, model)
        evaluator = m.SynchronousEvaluator(eval_set, model)
    
        trainer.train(1)
        evaluator.evaluate(True)
    
    
    if __name__ == "__main__":
        fb15k_example()
    
    

    Expected behavior A clear and concise description of what you expected to happen.

    Environment Mac os 11.2.3 big sur, python 3.9.2, pip installed from latest code on marius

    bug 
    opened by cthoyt 5
  • Improve configuration file generation

    Improve configuration file generation

    Is your feature request related to a problem? Please describe. Currently configuration generation is performed for every call to preprocess.py and three configuration files are generated. One for CPU training, one for GPU training, and one for multi-GPU training.

    Describe the solution you'd like We should put this generation into a separate optional step where it can be included in the preprocessing by adding a flag to the preprocessor call.

    E.g.

    python3 preprocess.py fb15k output_dir/                             // No config generated
    python3 preprocess.py fb15k output_dir/ --generate_config           // generates a single-GPU training configuration file by default
    python3 preprocess.py fb15k output_dir/ --generate_config=GPU       // generates a single-GPU training configuration file
    python3 preprocess.py fb15k output_dir/ --generate_config=CPU       // generates a CPU training configuration file
    python3 preprocess.py fb15k output_dir/ --generate_config=Multi-GPU // generates a multi-GPU training configuration default
    

    Adding config arguments should be supported too.

    python3 preprocess.py fb15k output_dir/ --generate_config=GPU  <args>
    python3 preprocess.py fb15k output_dir/ --generate_config=GPU  --model.embedding_size=400     // generates a single-GPU training configuration file for 400 dimensional embeddings
    

    We should also allow for this configuration generation to be called separately. E.g.

    python3 generate_config.py <files> <args> // args should include --embedding_dimension --num_partitions and --config_type (GPU, CPU, multi-GPU)
    

    Describe alternatives you've considered We could also disable this feature, but providing configuration file generation makes it easier on the users to get up and running on built-in and custom datasets.

    Additional context Eventually we will want to support generating a config based on user system characteristics. E.g. The users system has 64 GB of memory and wants to train 400 dimension embeddings on Freebase86m. We can set the number of partitions and buffer capacity to well utilize the available memory.

    enhancement 
    opened by JasonMoho 3
  • Add os import

    Add os import

    #22 was missing an OS import. Fixed.

    opened by JasonMoho 2
  • Boost download fails

    Boost download fails

    Describe the bug The boost download link is failing. See the excerpt below.

    This issue pops up time to time with boost: https://github.com/boostorg/boost/issues/299 https://github.com/Orphis/boost-cmake/issues/88

     [ 22%] Performing download step (download, verify and extract) for 'boost-populate'
        -- Downloading...
           dst='/tmp/pip-k4qygwbv-build/build/temp.linux-x86_64-3.6/_deps/boost-subbuild/boost-populate-prefix/src/boost_1_71_0.tar.bz2'
           timeout='none'
           inactivity timeout='none'
        -- Using src='https://dl.bintray.com/boostorg/release/1.71.0/source/boost_1_71_0.tar.bz2'
        -- [download 0% complete]
        CMake Error at boost-subbuild/boost-populate-prefix/src/boost-populate-stamp/download-boost-populate.cmake:170 (message):
          Each download failed!
        
            error: downloading 'https://dl.bintray.com/boostorg/release/1.71.0/source/boost_1_71_0.tar.bz2' failed
                  status_code: 22
                  status_string: "HTTP response code said error"
                  log:
                  --- LOG BEGIN ---
                    Trying 34.214.135.19:443...
        
          Connected to dl.bintray.com (34.214.135.19) port 443 (#0)
        
          ALPN, offering h2
        
          ALPN, offering http/1.1
        
          successfully set certificate verify locations:
        
           CAfile: /etc/ssl/certs/ca-certificates.crt
           CApath: /etc/ssl/certs
        
          [5 bytes data]
        
          TLSv1.3 (OUT), TLS handshake, Client hello (1):
        
          [512 bytes data]
        
          [5 bytes data]
        
          TLSv1.3 (IN), TLS handshake, Server hello (2):
        
          [102 bytes data]
        
          NPN, negotiated HTTP1.1
        
          [5 bytes data]
        
          TLSv1.2 (IN), TLS handshake, Certificate (11):
        
          [2765 bytes data]
        
          [5 bytes data]
        
          TLSv1.2 (IN), TLS handshake, Server key exchange (12):
        
          [333 bytes data]
        
          [5 bytes data]
        
          TLSv1.2 (IN), TLS handshake, Server finished (14):
        
          [4 bytes data]
        
          [5 bytes data]
        
          TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
        
          [70 bytes data]
        
          [5 bytes data]
        
          TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
        
          [1 bytes data]
        
          [5 bytes data]
        
          TLSv1.2 (OUT), TLS handshake, Next protocol (67):
        
          [36 bytes data]
        
          [5 bytes data]
        
          TLSv1.2 (OUT), TLS handshake, Finished (20):
        
          [16 bytes data]
        
          [5 bytes data]
        
          [5 bytes data]
        
          TLSv1.2 (IN), TLS handshake, Finished (20):
        
          [16 bytes data]
        
          SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
        
          ALPN, server did not agree to a protocol
        
          Server certificate:
        
           subject: CN=*.bintray.com
           start date: Sep 26 00:00:00 2019 GMT
           expire date: Nov  9 12:00:00 2021 GMT
           subjectAltName: host "dl.bintray.com" matched cert's "*.bintray.com"
           issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=GeoTrust RSA CA 2018
           SSL certificate verify ok.
        
          [5 bytes data]
        
          GET /boostorg/release/1.71.0/source/boost_1_71_0.tar.bz2 HTTP/1.1
        
          Host: dl.bintray.com
        
          User-Agent: curl/7.75.0
        
          Accept: */*
        
        
        
          [5 bytes data]
        
          Mark bundle as not supporting multiuse
        
          HTTP/1.1 403 Forbidden
        
          Server: nginx
        
          Date: Mon, 12 Apr 2021 15:10:54 GMT
        
          Content-Type: text/plain
        
          Content-Length: 10
        
          Connection: keep-alive
        
          ETag: "5c3b2e0c-a"
        
          The requested URL returned error: 403
        
          Closing connection 0
        
        
        
                  --- LOG END ---
        
        
        
        
        CMakeFiles/boost-populate.dir/build.make:98: recipe for target 'boost-populate-prefix/src/boost-populate-stamp/boost-populate-download' failed
        make[2]: *** [boost-populate-prefix/src/boost-populate-stamp/boost-populate-download] Error 1
        CMakeFiles/Makefile2:82: recipe for target 'CMakeFiles/boost-populate.dir/all' failed
        make[1]: *** [CMakeFiles/boost-populate.dir/all] Error 2
        Makefile:90: recipe for target 'all' failed
        make: *** [all] Error 2
        
        CMake Error at /opt/cmake/share/cmake-3.20/Modules/FetchContent.cmake:1012 (message):
          Build step for boost failed: 2
        Call Stack (most recent call first):
          /opt/cmake/share/cmake-3.20/Modules/FetchContent.cmake:1141:EVAL:2 (__FetchContent_directPopulate)
          /opt/cmake/share/cmake-3.20/Modules/FetchContent.cmake:1141 (cmake_language)
          third_party/boost-cmake/CMakeLists.txt:19 (FetchContent_Populate)
    

    To Reproduce Building Marius will encounter this issue if the boost servers are acting up.

    Expected behavior The download of boost should succeed.

    Environment Affects all environments

    Additional context We should remove the dependency on Boost. We only use it to parse .ini configuration files and for parsing command line options.

    blocking bug 
    opened by JasonMoho 2
  • Switch to `/src` layout for Python code

    Switch to `/src` layout for Python code

    Right now, the source code for both python and c++ is scattered among several folders. It would be most idiomatic to have all python code come under /src/marius and perhaps have /src/cpp for the other code. Not sure if this will be a problem with the extensions though

    enhancement 
    opened by cthoyt 1
  • Improve configuration file generation

    Improve configuration file generation

    The second part of the solution for #9 Added options for configuration generation and config arguments to config_generator.py Added tests for the above features

    Usage:

    1. Users can specify a built-in dataset and have the configuration generated to the specified directory E.g. python3 ./tools/config_generator.py <output_directory> -d <built-in dataset> --<section>.<option>=<value>
    2. Users can specify statistics of their own dataset and have the configuration generated to the specified directory E.g. python3 ./tools/config_generator.py <output_directory> -s <num_nodes> <num_relations> <num_train> <num_valid> <num_test> --<section>.<option>=<value>
    opened by AnzeXie 1
  • fix incorrect generation of elimination orderings

    fix incorrect generation of elimination orderings

    Small bug which resulted in generating incorrect elimination orderings which result in a large number of partition swaps.

    Fixed and verified with the python buffer simulator that the orderings generated now result in the correct number of swaps.

    opened by JasonMoho 1
  • Update configurations

    Update configurations

    Update configurations according to docs/configuration.rst

    opened by AnzeXie 1
  • Cleanup pip install and fix readme example

    Cleanup pip install and fix readme example

    • Adds a step in GitHub actions workflow to test the pip install
    • Adds fb15k end to end test based on the readme python example
    • Updated readme python example to include preprocessing step
    • Fix torch GIL issue when calling Trainer.train(). GIL needs to be released for any code which uses the autograd library
    • Rename marius directory to marius_package, to avoid issues with python scripts in the root directory that import marius
    • Test data cleanup
    • Cleanup unnecessary .gitignore entries

    Fixes #23

    opened by JasonMoho 0
  • Add basic out of memory shuffle for FlatFile storage backend

    Add basic out of memory shuffle for FlatFile storage backend

    Previously, to shuffle edges using the FlatFile backend (where the edge list is stored as a file on disk and read sequentially), all edges are loaded into memory and shuffled. This limits the size the the datasets which can be shuffled.

    This change adds an out of memory shuffle implementation which shuffles the dataset in chunks of 400 million edges (if unpartitioned). This chunk size results in about 8GB of memory overhead for shuffling when representing edges as int32s. (400 million * 3 * 4 bytes) + (400 million * 1 * 8 bytes) = 8GB. Where the latter term is the overhead of storing the permutation indices. When using int64s the overhead becomes (400 million * 3 * 8 bytes) + (400 million * 1 * 8 bytes) = 12.8GB.

    If the dataset is partitioned then the dataset will be shuffled in chunks that correspond to the edge buckets.

    This change also includes a small modification to how test data is managed, by placing it in build/test/test_data rather than build/test/cpp/test_data.

    The current implementation can be improved in the future to get a better shuffle by shuffling a sliding window over the edge list, but the current implementation should be good enough.

    opened by JasonMoho 0
  • Boost dependency removed

    Boost dependency removed

    Removed Boost dependency/switched to cxxopts to resolve building and download issues, restructured config parser, added additional config unit tests.

    opened by awcarlsson 0
  • GIL issue thrown when testing pip install on macOS workflow

    GIL issue thrown when testing pip install on macOS workflow

    Describe the bug MacOS pip install test throwing GIL error even though all tests pass: https://github.com/marius-team/marius/runs/2401968116

    Could be an issue with Python 3.9 since the linux workflow passes but uses Python 3.8. Possibly related to https://github.com/pytorch/pytorch/issues/49370

    Output:

    2021-04-21T16:01:51.7260960Z ##[group]Run python3 -c "import marius as m"
    2021-04-21T16:01:51.7261640Z python3 -c "import marius as m"
    2021-04-21T16:01:51.7262230Z python3 -c "from marius.tools import preprocess"
    2021-04-21T16:01:51.7262850Z marius_preprocess fb15k output_dir/
    2021-04-21T16:01:51.7263760Z pytest test
    2021-04-21T16:01:51.8917040Z shell: /bin/bash --noprofile --norc -e -o pipefail {0}
    2021-04-21T16:01:51.8917570Z env:
    2021-04-21T16:01:51.8918010Z   BUILD_TYPE: Release
    2021-04-21T16:01:51.8918430Z ##[endgroup]
    2021-04-21T16:02:03.4541320Z fb15k
    2021-04-21T16:02:03.4642510Z Downloading fb15k.tgz to output_dir/fb15k.tgz
    2021-04-21T16:02:03.4658930Z Extracting
    2021-04-21T16:02:03.4659870Z Extraction completed
    2021-04-21T16:02:03.4660660Z Detected delimiter: 	
    2021-04-21T16:02:03.4662650Z Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
    2021-04-21T16:02:03.4664160Z Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
    2021-04-21T16:02:03.4665790Z Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
    2021-04-21T16:02:03.4666760Z Number of instance per file:[483142, 50000, 59071]
    2021-04-21T16:02:03.4667560Z Number of nodes: 14951
    2021-04-21T16:02:03.4668370Z Number of edges: 592213
    2021-04-21T16:02:03.4669180Z Number of relations: 1345
    2021-04-21T16:02:03.4670000Z Delimiter: ~	~
    2021-04-21T16:02:05.0357020Z ============================= test session starts ==============================
    2021-04-21T16:02:05.0358980Z platform darwin -- Python 3.9.4, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
    2021-04-21T16:02:05.0360090Z rootdir: /Users/runner/work/marius/marius
    2021-04-21T16:02:05.0360930Z collected 29 items
    2021-04-21T16:02:05.0361460Z 
    2021-04-21T16:04:46.3756720Z test/python/bindings/test_fb15k.py .                                     [  3%]
    2021-04-21T16:04:46.4321450Z test/python/preprocessing/test_config_generator_cmd_opt_parsing.py ..... [ 20%]
    2021-04-21T16:04:47.7820700Z .........                                                                [ 51%]
    2021-04-21T16:04:47.8108760Z test/python/preprocessing/test_csv_preprocessor.py .                     [ 55%]
    2021-04-21T16:04:59.0886020Z test/python/preprocessing/test_preprocess_cmd_opt_parsing.py ........... [ 93%]
    2021-04-21T16:04:59.1086690Z ..                                                                       [100%]
    2021-04-21T16:04:59.1171760Z 
    2021-04-21T16:04:59.1204890Z ======================== 29 passed in 175.06s (0:02:55) ========================
    2021-04-21T16:04:59.2552200Z Fatal Python error: PyEval_SaveThread: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
    2021-04-21T16:04:59.2652700Z Python runtime state: finalizing (tstate=0x7fe41c409b50)
    2021-04-21T16:04:59.2754080Z 
    2021-04-21T16:04:59.2856250Z /Users/runner/work/_temp/511be060-bb2e-418a-ac5e-2e0f5d09f4d7.sh: line 4:  5232 Abort trap: 6           pytest test
    

    To Reproduce Run the macOS pip install test workflow

    Expected behavior The pip install works fine on linux:

    2021-04-21T15:50:21.1538556Z python3 -c "import marius as m"
    2021-04-21T15:50:21.1539213Z python3 -c "from marius.tools import preprocess"
    2021-04-21T15:50:21.1539916Z marius_preprocess fb15k output_dir/
    2021-04-21T15:50:21.1540448Z pytest test
    2021-04-21T15:50:21.1584496Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
    2021-04-21T15:50:21.1585040Z env:
    2021-04-21T15:50:21.1585484Z   BUILD_TYPE: Release
    2021-04-21T15:50:21.1586287Z ##[endgroup]
    2021-04-21T15:50:26.6729316Z fb15k
    2021-04-21T15:50:26.6730578Z Downloading fb15k.tgz to output_dir/fb15k.tgz
    2021-04-21T15:50:26.6731334Z Extracting
    2021-04-21T15:50:26.6731982Z Extraction completed
    2021-04-21T15:50:26.6732836Z Detected delimiter: 	
    2021-04-21T15:50:26.6734284Z Reading in output_dir/freebase_mtr100_mte100-train.txt   1/3
    2021-04-21T15:50:26.6735973Z Reading in output_dir/freebase_mtr100_mte100-valid.txt   2/3
    2021-04-21T15:50:26.6738109Z Reading in output_dir/freebase_mtr100_mte100-test.txt   3/3
    2021-04-21T15:50:26.6739043Z Number of instance per file:[483142, 50000, 59071]
    2021-04-21T15:50:26.6739918Z Number of nodes: 14951
    2021-04-21T15:50:26.6740497Z Number of edges: 592213
    2021-04-21T15:50:26.6741087Z Number of relations: 1345
    2021-04-21T15:50:26.6741661Z Delimiter: ~	~
    2021-04-21T15:50:27.8808863Z ============================= test session starts ==============================
    2021-04-21T15:50:27.8811125Z platform linux -- Python 3.8.5, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
    2021-04-21T15:50:27.8812170Z rootdir: /home/runner/work/marius/marius
    2021-04-21T15:50:27.8812954Z collected 29 items
    2021-04-21T15:50:27.8813617Z 
    2021-04-21T15:50:50.9462537Z test/python/bindings/test_fb15k.py .                                     [  3%]
    2021-04-21T15:50:50.9827642Z test/python/preprocessing/test_config_generator_cmd_opt_parsing.py ..... [ 20%]
    2021-04-21T15:50:51.6762691Z .........                                                                [ 51%]
    2021-04-21T15:50:51.6988451Z test/python/preprocessing/test_csv_preprocessor.py .                     [ 55%]
    2021-04-21T15:50:57.6109717Z test/python/preprocessing/test_preprocess_cmd_opt_parsing.py ........... [ 93%]
    2021-04-21T15:50:57.6234674Z ..                                                                       [100%]
    2021-04-21T15:50:57.6235430Z 
    2021-04-21T15:50:57.6236116Z ============================= 29 passed in 30.61s ==============================
    

    Environment MacOS: platform darwin -- Python 3.9.4, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 Linux: platform linux -- Python 3.8.5, pytest-6.2.3, py-1.10.0, pluggy-0.13.1

    Additional context test/python/bindings/test_fb15k.py is the likely culprit for throwing errors since it's the only one which runs the bindings. Unclear why it marks the test as passed.

    bug 
    opened by JasonMoho 1
  • Scope out additional decoder models

    Scope out additional decoder models

    Our current functionality is limited. We only support DistMult, ComplEx, and TransE, with double-sided relation embeddings.

    We should expand our functionality by adding more models to Marius. The first thing to do is to scope out which models are out there and which can be implemented easily in our current abstractions.

    A starting point is to look into the models supported by PyKeen: List: https://github.com/pykeen/pykeen#models-26 Implementation: https://github.com/pykeen/pykeen/blob/master/src/pykeen/nn/functional.py Documentation: https://pykeen.readthedocs.io/en/stable/api/pykeen.nn.functional.convkb_interaction.html

    Once we get a better handle on which models are out there, we can see in what ways our current abstractions are lacking and how we can improve them.

    For each decoder model we should ask the following questions:

    • Does the model need additional information beyond a source, relation and destination embedding for a given edge?
    • Does the model require computation that does not fit with our double-sided relation embedding approach? E.g score_lhs = comparator(relation_operator(h, r), t) and score_rhs = comparator(relation_operator(t, r'), h)
    • What modifications would we need to make to LinkPredictionDecoder to support this model?
    • Is there global state we need to maintain?
    enhancement 
    opened by JasonMoho 0
  • error for final make step

    error for final make step

    Describe the bug

    I am trying to build marius on a google cloud vm. The build went smoothly until the final step at which point I got an error at the linking step:

    [100%] Built target marius Scanning dependencies of target marius_train [100%] Building CXX object CMakeFiles/marius_train.dir/src/marius.cpp.o [100%] Linking CXX executable marius_train /usr/bin/ld: libmarius.so: undefined reference to std::filesystem::copy_file(std::filesystem::path const&, std::filesys tem::path const&, std::filesystem::copy_options)' /usr/bin/ld: libmarius.so: undefined reference tostd::filesystem::path::_M_split_cmpts()' /usr/bin/ld: libmarius.so: undefined reference to std::filesystem::status(std::filesystem::path const&)' /usr/bin/ld: libmarius.so: undefined reference tostd::filesystem::rename(std::filesystem::path const&, std::filesystem ::path const&)' collect2: error: ld returned 1 exit status make[3]: *** [CMakeFiles/marius_train.dir/build.make:100: marius_train] Error 1 make[2]: *** [CMakeFiles/Makefile2:113: CMakeFiles/marius_train.dir/all] Error 2 make[1]: *** [CMakeFiles/Makefile2:125: CMakeFiles/marius_train.dir/rule] Error 2 make: *** [Makefile:188: marius_train] Error 2

    To Reproduce Steps to reproduce the behavior:

    Follow the installation instruction from github README:

    git clone https://github.com/marius-team/marius.git cd marius python3 -m pip install -r requirements.txt mkdir build cd build cmake ../ -DUSE_CUDA=1 make marius_train -j

    Expected behavior The final step of the build should create marius executables, presumably in the build/bin directory.

    Environment This build was attempted on a google cloud vm with these parameters:

    boot disk = tensorflow-2-4-20210414-140511-boot Environment version = M65

    I can provide more environment details after sshing into the instance, but not sure what is relevant for the above.

    bug 
    opened by realmarcin 1
  • Remove dependency on Boost

    Remove dependency on Boost

    Is your feature request related to a problem? Please describe. We currently use Boost for command line argument parsing and parsing .ini configuration files.

    Boost is very heavyweight and complicates the build process. Additionally, the download links to the boost library may fail: See #16.

    Describe the solution you'd like We should remove the dependency on Boost by switching to a lightweight library which can parse .ini files and command line arguments with the same semantics.

    The implementation with the new library should match functionality with the current implementation in Boost.

    Modifications will be largely contained to src/config.cpp.

    One minor dependency on Boost's lockfree queues can be removed in src/buffer.cpp and replaced with traditional lock + queue data structure.

    Describe alternatives you've considered We can implement our own parsing functionality if no libraries fit our requirements.

    Additional context We might not be able to find a library which does both config parsing and the command line parsing. If we cannot, we should pick one which can do the config parsing and then implement our own command line parser.

    enhancement 
    opened by JasonMoho 3
  • Populate Documentation

    Populate Documentation

    What is the documentation lacking? Please describe. The documentation is only populated for describing the configuration files. The rest of the documentation needs to be filled out.

    Describe the improvement you'd like

    Add documentation for:

    • Training embeddings graphs of varying scale
    • Evaluating embeddings
    • Configuration file usage
    • Python API description and usage
    • Custom Model definition
    • Preprocessing and input file formats
    • Postprocessing, output file formats and downstream inference
    • How Marius works
    • Edge bucket orderings and the partition buffer
    • Pipelining and asynchronous training
    • Structure of the codebase
    • Development information and workflows

    Additional context The documentation should also be built and hosted automatically on the marius-project.org website. This can be put in a separate pull request.

    documentation 
    opened by JasonMoho 0
  • Complete and test custom model support with the python bindings

    Complete and test custom model support with the python bindings

    Is your feature request related to a problem? Please describe. The pythons bindings need additional implementation to support custom models.

    Describe the solution you'd like We should be able to support defining a custom model in the python API by doing the following:

    import pymarius as m
    
    class customRelationOperator(m.RelationOperator):
        def forward(node_embs, rel_embs):
            return node_embs + rel_embs
    
    class customComparator(m.Comparator):
        def forward(src_embs, dst_embs):
            return src_embs * dst_embs
    
    class CustomModel(m.Model):
        def __init__():
            self.decoder = m.LinkPredictionDecoder(customComparator(), customRelationOperator())
    

    We may need to make modifications of the c++ to support these semantics.

    Tests should be written for the custom models here: https://github.com/marius-team/marius/tree/main/test/python

    We should test:

    • The relation operator and comparator forward methods compute the proper values
    • The model.forward() method matches the expected computation
    • The model is used when used in a training loop. Pytest has functionality to tell if a function has been called. So the tests can ensure that during a single training loop, the model.forward() function was called for every batch, same for the relation operator and the comparator.

    Describe alternatives you've considered Alternative designs for custom models might require large changes to the core c++ code.

    Additional context For the rest of the bindings, we will add their tests in a future pull request.

    enhancement 
    opened by JasonMoho 0
  • Tests and validators for csv_converter.py

    Tests and validators for csv_converter.py

    Is your feature request related to a problem? Please describe. The converter for delimited files does not have a set of tests associated with it.

    Describe the solution you'd like We should add tests for each function in csv_converter.py which cover reasonable inputs and possible failure modes.

    For example, for the general_parser function https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L118 we should test:

    • Different numbers of files
    • Invalid input files
    • Delimiters
    • Dataset splits
    • Number of partitions
    • Other input arguments (format, dtype, remap_ids, start_col, and num_line_skip)

    Part of this testing effort should be to add validators to the input arguments to the general_parsers ensure no unreasonable values are passed into it: e.g a dataset split of (.8, .8), or a format ("sxrd"), etc.

    Describe alternatives you've considered The alternative is to leave it untested. No thanks.

    Additional context While testing we should note the ways we can improve and simplify the design of the preprocessing code and create a list of changes we will want to make in a future pull request. For example, https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L213, https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L238, and https://github.com/marius-team/marius/blob/main/tools/csv_converter.py#L252 should be put into a function and called.

    enhancement testing 
    opened by JasonMoho 1
Owner
Marius
Graph Learning at Scale
Marius
Monero: the secure, private, untraceable cryptocurrency

Monero: the secure, private, untraceable cryptocurrency

The Monero Project 5.6k Mar 28, 2021
Dogecoin is a cryptocurrency like Bitcoin

Dogecoin is a cryptocurrency like Bitcoin, although it does not use SHA256 as its proof of work (POW). Taking development cues from Tenebrix and Litecoin, Dogecoin currently employs a simplified variant of scrypt.

Dogecoin 6.5k Apr 20, 2021
MIRACL Cryptographic SDK: Multiprecision Integer and Rational Arithmetic Cryptographic Library is a C software library that is widely regarded by developers as the gold standard open source SDK for elliptic curve cryptography (ECC).

MIRACL What is MIRACL? Multiprecision Integer and Rational Arithmetic Cryptographic Library – the MIRACL Crypto SDK – is a C software library that is

MIRACL 388 Feb 13, 2021
LibSWIFFT - A fast C/C++ library for the SWIFFT secure homomorphic hash function

LibSWIFFT - A fast C/C++ library for the SWIFFT secure homomorphic hash function Official Repository LibSWIFFT is a production-ready C/C++ library pro

Gvili Tech Ltd 20 Feb 28, 2021
A high-performance distributed Bitcoin mining pool server.

Viabtc Mining Server ViaBTC Mining Server is a high-performance distributed Bitcoin mining pool server. We have made a lot of optimizations for Bitcoi

ViaBTC 36 Apr 2, 2021
Intel:registered: Homomorphic Encryption Acceleration Library accelerates modular arithmetic operations used in homomorphic encryption

Intel Homomorphic Encryption Acceleration Library (HEXL) Intel ®️ HEXL is an open-source library which provides efficient implementations of integer a

Intel Corporation 29 Apr 2, 2021
free C++ class library of cryptographic schemes

Crypto++: free C++ Class Library of Cryptographic Schemes Version 8.4 - TBD Crypto++ Library is a free C++ class library of cryptographic schemes. Cu

null 2.7k Feb 19, 2021
TLS/SSL and crypto library

Welcome to the OpenSSL Project OpenSSL is a robust, commercial-grade, full-featured Open Source Toolkit for the Transport Layer Security (TLS) protoco

OpenSSL 14.8k Feb 19, 2021
s2n : an implementation of the TLS/SSL protocols

s2n is a C99 implementation of the TLS/SSL protocols that is designed to be simple, small, fast, and with security as a priority. It is released and l

Amazon Web Services 3.9k Feb 19, 2021
Ethereum miner with OpenCL, CUDA and stratum support

Ethminer is an Ethash GPU mining worker: with ethminer you can mine every coin which relies on an Ethash Proof of Work thus including Ethereum, Ethereum Classic, Metaverse, Musicoin, Ellaism, Pirl, Expanse and others. This is the actively maintained version of ethminer. It originates from cpp-ethereum project (where GPU mining has been discontinued) and builds on the improvements made in Genoil's fork. See FAQ for more details.

null 4.5k Mar 28, 2021
Bitcoin Core integration/staging tree

Bitcoin is an experimental digital currency that enables instant payments to anyone, anywhere in the world. Bitcoin uses peer-to-peer technology to operate with no central authority: managing transactions and issuing money are carried out collectively by the network. Bitcoin Core is the name of open source software which enables the use of this currency.

Bitcoin 51.1k Mar 29, 2021
DARKCAT Project - A Strong Prototype Crypto-Locker

Darkcat is an Open Source Crypto-locker directed at an audience with an interest in the field of Cyber Security. The locker is similar to how very obnoxious Ransomwares operate using 2-Layer Key Encryption with the intent of making it almost impossible to recover any key from memory even during the event of Encryption.

Alexander Töpfer 25 Feb 8, 2021
RdpCacheStitcher is a tool that supports forensic analysts in reconstructing useful images out of RDP cache bitmaps.

RdpCacheStitcher is a tool that supports forensic analysts in reconstructing useful images out of RDP cache bitmaps. Using raw RDP cache tile bitmaps extracted by tools like e.g. ANSSI's BMC-Tools as input, it provides a graphical user interface and several placement heuristics for stitching tiles together so that meaningful images or even full screenshots can be reconstructed.

Bundesamt für Sicherheit in der Informationstechnik 46 Apr 23, 2021
A lightweight, secure, easy-to-use crypto library suitable for constrained environments.

The Hydrogen library is a small, easy-to-use, hard-to-misuse cryptographic library. Features: Consistent high-level API, inspired by libsodium. Instea

Frank Denis 241 Feb 15, 2021