A library for distributed ML training with PyTorch

Overview

moolib

moolib - a communications library for distributed ML training

moolib offers general purpose RPC with automatic transport selection (shared memory, TCP/IP, Infiniband) allowing models to data-parallelise their training and synchronize gradients and model weights across many nodes.

moolib is an RPC library to help you perform distributed machine learning research, particularly reinforcement learning. It is designed to be highly flexible and highly performant.

It is flexible because it allows researchers to define their own training loops and data-collection policies with minimal interference or abstractions - moolib gets out of the way of research code.

It is performant because it gives researchers the power of efficient data-parallelization across GPUs with minimal overhead, in a manner that is highly scalable.

moolib aims to provide researchers with the freedom to implement whatever experiment loop they desire, and the freedom to scale it up from single GPUs to hundreds at will (with no additional code). It ships with a reference implementations IMPALA on Atari that can easily be adapted to other environments or algorithms.

Installing

To compile moolib without CUDA support

EXPORT USE_CUDA=0

To install from GitHub:

pip install git+https://github.com/facebookresearch/moolib

To build from source:

git clone --recursive [email protected]:facebookresearch/moolib
cd moolib && pip install .

How to host docs (after installation):

pip install sphinx==4.1.2
cd docs && ./run_docs.sh

Run an Example

To run the example agent on a given Atari level:

First, start the broker:

python -m moolib.broker

It will output something like Broker listening at 0.0.0.0:4431.

Note that a single broker is enough for all your experiments.

Now take the IP address of your computer. If you ssh'd into your machine, this should work (in a new shell):

export BROKER_IP=$(echo $SSH_CONNECTION | cut -d' ' -f3)  # Should give your machine's IP.
export BROKER_PORT=4431

To start an experiment with a single peer:

python -m examples.vtrace.experiment connect=BROKER_IP:BROKER_PORT \
    savedir=/tmp/moolib-atari/savedir \
    project=moolib-atari \
    group=Zaxxon-Breakout \
    env.name=ALE/Breakout-v5

To add more peers to this experiment, start more processes with the same project and group settings, using a different setting for device (default: 'cuda:0').

Documentation

See moolib's API documentation.

Benchmarks

Show results on Atari

atari_1 atari_2

License

moolib is licensed under the MIT License. See LICENSE for details.

Issues
  • More doc, tutorial, and whitepaper

    More doc, tutorial, and whitepaper

    Thank you so much for open-sourcing this library. It looks very useful and interesting. Do you have plans to release more docs, tutorials, and/or whitepaper that explains the design choices? How does moolib compare to alternative frameworks like Ray or PyTorch RPC, in terms of performance benchmarking and API design?

    Many thanks for your help!

    opened by LinxiFan 4
  • Add Dockerfile.

    Add Dockerfile.

    This successfully compiles moolib and runs the a2c.py toy example.

    Perhaps we could add some kind of automation to this? Upload to dockerhub? @condnsdmatters, wdyt?

    CLA Signed 
    opened by heiner 2
  • fix silly numpy import/fork bug

    fix silly numpy import/fork bug

    This is pretty silly, but during serialization we check if the input is a numpy array (despite #17). For this check, pybind needs to import numpy. During import, my local version of numpy calls fork. From googling, this is seen as a no-no, but it does it regardless. I've not investigated why it does the fork. Forking during serialization breaks moolib with the usual dont-fork fatal error message (and setting the MOOLIB_ALLOW_FORK env var results in a deadlock). There is no issue if the user already imported numpy (or pytorch, as that also pulls in numpy). However, if they didn't, even the most basic usage of moolib results in a fatal error.

    This fix calls the pybind code that in turn imports numpy during Moolib module initialization, thus numpy is already imported during serialization.

    CLA Signed 
    opened by tscmoo 1
  • Serialization issue for python class with multiple member variables when running in parallel.

    Serialization issue for python class with multiple member variables when running in parallel.

    It seems the fix for https://github.com/facebookresearch/moolib/issues/14 introduced some new issues. When transmitting python classes with multiple member variables and there are multiple clients running in parallel, we will see some errors as below. It seems the issue is from the newly added readinto function. I created a reproduce script listed below. We have to keep on working on this.

    RuntimeError: Remote exception during RPC call (print_data): BufferError: buffer size is 42, but readinto requested 161372 bytes
    
    import asyncio
    import time
    
    from typing import Any, Callable, NoReturn, Optional
    
    import torch
    import torch.multiprocessing as mp
    
    import moolib
    
    addr = "127.0.0.1:4411"
    timeout = 60
    
    
    class DataWrapper:
        def __init__(self,
                     x: Optional[Any] = None,
                     y: Optional[Any] = None) -> None:
            self.x = x
            self.y = y
    
        def __str__(self) -> str:
            return f"[{self.x}]"
    
        def __repr__(self) -> str:
            return self.__str__()
    
    
    def print_data(x: Any) -> None:
        if isinstance(x, tuple):
            print(f"[print_data] batch_size = {len(x)}")
        else:
            print(f"[print_data] data_size = {x.x.size()}")
        time.sleep(0.01)
    
    
    def handle_task_exception(task: asyncio.Task) -> None:
        try:
            task.result()
        except asyncio.CancelledError:
            pass
        except Exception as e:
            raise e
    
    
    async def process(que: moolib.Queue, callback: Callable[..., Any]) -> NoReturn:
        try:
            while True:
                ret_cb, args, kwargs = await que
                if args and kwargs:
                    ret = callback(*args, **kwargs)
                elif args:
                    ret = callback(*args)
                elif kwargs:
                    ret = callback(**kwargs)
                else:
                    ret = callback()
                ret_cb(ret)
        except asyncio.CancelledError:
            print("[Server] process cancelled")
            pass
        except Exception as e:
            print(e)
            raise
    
    
    def server_func() -> NoReturn:
        server = moolib.Rpc()
        server.set_name("server")
        server.set_timeout(timeout)
        server.listen(addr)
    
        loop = asyncio.get_event_loop()
        task = loop.create_task(
            process(server.define_queue("print_data"), print_data))
        # task = loop.create_task(
        #     process(
        #         server.define_queue("print_data",
        #                             batch_size=100,
        #                             dynamic_batching=True), print_data))
        task.add_done_callback(handle_task_exception)
        loop.run_until_complete(task)
    
    
    def client_func(index: int) -> None:
        client = moolib.Rpc()
        client.set_name(f"client-{index}")
        client.set_timeout(timeout)
        client.connect(addr)
    
        x = torch.randn(496 + index, 10)
        y = torch.randn(422 + index, 95)
    
        num = 10
        x_wrapped = DataWrapper(x, y)
        for _ in range(num):
            client.sync("server", "print_data", x_wrapped)
            time.sleep(0.01)
    
    
    def main() -> None:
        server_proc = mp.Process(target=server_func)
        server_proc.start()
    
        num_clients = 4
        client_processes = []
        for i in range(num_clients):
            client_proc = mp.Process(target=client_func, args=(i, ))
            client_proc.start()
            client_processes.append(client_proc)
    
        for client_proc in client_processes:
            client_proc.join()
    
        server_proc.terminate()
    
    
    if __name__ == "__main__":
        mp.set_start_method("spawn")
        main()
    
    opened by xiaomengy 1
  • Transmission failed for python classes with specific shape torch.Tensor members.

    Transmission failed for python classes with specific shape torch.Tensor members.

    When moolib transmit a class instance which contains a torch.Tensor member with specific shape, it may fail with the error below. While if we transmit the Tensor directly there is no problem. This issue may block the GNN use cases because Graphs are usually lib defined classes with such tensors. One reproduce script on devfair is listed below. We have to fix this issue.

    RuntimeError: Remote exception during RPC call (print_data): ValueError: read() returned non-bytes object (<class 'memoryview'>)
    
    import asyncio
    import traceback
    
    from typing import Any
    
    import torch
    import moolib
    
    
    class DataWrapper:
    
        def __init__(self, data: Any) -> None:
            self.data = data
    
        def __str__(self) -> str:
            return f"[{self.data}]"
    
        def __repr__(self) -> str:
            return self.__str__()
    
    
    async def process(que, callback):
        try:
            while True:
                ret_cb, args, kwargs = await que
                if args and kwargs:
                    ret = callback(*args, **kwargs)
                elif args:
                    ret = callback(*args)
                elif kwargs:
                    ret = callback(**kwargs)
                else:
                    ret = callback()
                ret_cb(ret)
        except asyncio.CancelledError:
            print("[Server] process cancelled")
            pass
        except Exception as e:
            print(e)
            raise
    
    
    async def main():
        addr = "127.0.0.1:4411"
        timeout = 60
    
        loop = asyncio.get_running_loop()
    
        server = moolib.Rpc()
        server.set_name("server")
        server.set_timeout(timeout)
    
        def print_data(x: Any) -> None:
            print(f"[print_data] graph = {x}")
            return x
    
        loop.create_task(process(server.define_queue("print_data"), print_data))
    
        server.listen(addr)
    
        client = moolib.Rpc()
        client.set_name("client")
        client.set_timeout(timeout)
        client.connect(addr)
    
        x = torch.randn(422, 95)
        x_wrapped = DataWrapper(x)
    
        num = 20
        futs = []
        for _ in range(num):
            # fut = client.async_("server", "print_data", x)  # This works
            fut = client.async_("server", "print_data", x_wrapped)
            futs.append(fut)
        for fut in futs:
            await fut
    
    
    if __name__ == "__main__":
        try:
            asyncio.run(main())
        except:
            traceback.print_exc()
    
    
    opened by xiaomengy 1
  • Fix deadlock #6

    Fix deadlock #6

    When requests and responses (and thus Buffers) are acked and freed, pytorch tensors can be freed. This can trigger code that presumably frees up python things, takes the GIL etc. We free this data while holding some locks, and perhaps even holding the GIL, thus deadlocks can occur.

    This fix moves Buffers from the requests/responses into an empty Deferred function call, before freeing the structures. This way, the actual Buffers will be freed after the deferred function is called, which is very soon after, but outside of any locks. This should be safe regardless of what pytorch does. We do the same thing with user callbacks, since tensors and anything else could live in those.

    CLA Signed 
    opened by tscmoo 1
  • Deadlock on machines with limited computation resources

    Deadlock on machines with limited computation resources

    It seems moolib will reach some deadlocks when releasing tensor at handleAck. The weird thing is this almost never happened on machines with enough CPU cores such as our FAIR cluster so we didn't realize such issue. However, if we run the code below on a desktop machine, the code will almost always hang at each runs. The deadlock happens at the dtor of torch::Tensor in moolib::Any here. It may also related to GIL.

    We have to fix it ASAP otherwise most users cannot use this successfully.

    import asyncio
    import time
    import traceback
    
    import torch
    import moolib
    
    
    async def process(que, callback):
        try:
            while True:
                ret_cb, args, kwargs = await que
                if args and kwargs:
                    ret = callback(*args, **kwargs)
                elif args:
                    ret = callback(*args)
                elif kwargs:
                    ret = callback(**kwargs)
                else:
                    ret = callback()
                ret_cb(ret)
        except asyncio.CancelledError:
            print("[Server] process cancelled")
            pass
        except Exception as e:
            print(e)
            raise
    
    
    async def main():
        addr = "127.0.0.1:4411"
        timeout = 60
    
        num_tests = 100
        dim = 128
        linear = torch.nn.Linear(dim, dim)
    
        server = moolib.Rpc()
        server.set_name("server")
        server.set_timeout(timeout)
        server.listen(addr)
    
        loop = asyncio.get_running_loop()
    
        def run_linear(x, idx):
            print(f"[Linear {idx}] x_size = {x.size()}")
            return linear(x)
    
        loop.create_task(process(server.define_queue("linear"), run_linear))
    
        client = moolib.Rpc()
        client.set_name("client")
        client.set_timeout(timeout)
        client.connect(addr)
    
        x = torch.randn(num_tests, dim)
        y = linear(x)
    
        x_list = torch.unbind(x)
        y_list = torch.unbind(y)
    
        futs = []
        for i, x in enumerate(x_list):
            futs.append(client.async_("server", "linear", x, i))
        for i, fut in enumerate(futs):
            y = await fut
            assert torch.allclose(y, y_list[i], rtol=1e-5, atol=1e-6)
    
    
    if __name__ == "__main__":
        try:
            asyncio.run(main())
        except:
            traceback.print_exc()
    
    opened by xiaomengy 1
  • update allocators

    update allocators

    This updates both the Function and Buffer allocators. The previous behavior of both was that they cached allocations in a purely thread-local manner, which may not always improve performance and even effectively leak memory under some usage patterns (when memory is allocated in one thread and freed in another). This introduces a global "backup" area where memory will be moved to if the thread-local area reaches a certain size, and from which memory will be allocated if the thread-local area is empty.

    CLA Signed 
    opened by tscmoo 0
  • update async

    update async

    Small improvement to async, the asynchronous function dispatcher, which runs a given function in a thread from a thread pool. Used in various places in moolib to run things asynchronously.

    This is just a performance improvement, mostly by doing some spinning instead of always waiting on a semaphore, and decreasing cache line contention by decreasing cache line contention.

    I've not done any thorough performance benchmarks.

    CLA Signed 
    opened by tscmoo 0
  • Fix #19 pickle concurrency bug

    Fix #19 pickle concurrency bug

    Fix for issue #19.

    All pickling was using one global (static) python object (File) for serialization, I suppose under the assumption that the GIL would protect it. It doesn't. This fix moves the python object onto the stack, such that each invocation of the serialization uses its own object, and multiple objects can be serialized concurrently.

    CLA Signed 
    opened by tscmoo 0
  • pickle: implement readinto to fix #14

    pickle: implement readinto to fix #14

    Pickle has a special code path for "larger" data, and it doesn't want to accept the memoryview object that moolib returns from read. However, it will prefer to use the readinto function anyways, so I implemented that, and it now works. Implementation is still tied to the pickle implementation, which is perhaps not ideal but it works for now.

    CLA Signed 
    opened by tscmoo 0
  • broker crashing, rv < 0: network is unreachable

    broker crashing, rv < 0: network is unreachable

    I'm running the IMPALA vtrace example, on one machine with a single peer. Using export BROKER_IP=$(echo $SSH_CONNECTION | cut -d' ' -f3), the moolib.broker crashes at some point in the run (and the peer job also crashes):

    $ python -m moolib.broker
    Broker listening at 0.0.0.0:4431
    terminate called after throwing an instance of 'std::runtime_error'
      what():  In connectFromLoop at .../moolib/src/tensorpipe/tensorpipe/transport/uv/uv.h:313 "rv < 0: network is unreachable"
    Aborted (core dumped)
    

    However, I have not run into this issue yet when starting the peer using BROKER_IP=0.0.0.0.

    opened by etaoxing 3
  • Tests broken

    Tests broken

    Seems like there's some regression resulting in

                obs_future = envs.step(0, action_t)
    >           obs = obs_future.result()
    E           RuntimeError: Timed out waiting for env
    
    examples/a2c.py:216: RuntimeError
    =========================== short test summary info ============================
    FAILED test/integration/test_a2c.py::TestA2CExample::test_single_node_training
    !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
    

    See 1 and 2.

    opened by heiner 2
  • Atari median human-normalized score

    Atari median human-normalized score

    Hi devs,

    thanks a lot for the great library. It's been observed that moolib improves quite significantly over torchbeast. Great!

    May I know if you have or you could generate the aggregated median human-normalized score curve of all the tested games? I'm wondering how it compares to the original IMPALA paper. Also I'm wondering what makes the improvements, do you have intuitions over it? moolib improves the distributed communication but I don't think it is directly related to dramatic reward improvements. It would be great if you can help me to understand more on it. thanks in advance!

    opened by zhongwen 3
  • packet contained unknown content: -1

    packet contained unknown content: -1

    I am trying to run parallel worker processes that send dictionary of tensors to an inference process where batched neural network forwarding happens. I get error E0209 20:15:45.299231 2754780 /tmp/pip-req-build-fpmouidj/src/tensorpipe/tensorpipe/core/listener_impl.cc:333] packet contained unknown content: -1. The program runs fine and terminates fine. I believe the results are correct too. Here is the minimal code to reproduce the error.

    import asyncio
    import multiprocessing as mp
    import moolib
    import torch
    
    
    local_addr = "127.0.0.1:4412"
    
    async def process(queue, callback):
        i = 0
        try:
            while True:
                # print("waiting for batch", i)
                i += 1
                return_callback, args, kwargs = await queue
                if args and kwargs:
                    retval = callback(*args, **kwargs)
                elif args:
                    retval = callback(*args)
                elif kwargs:
                    retval = callback(**kwargs)
                else:
                    retval = callback()
                return_callback(retval)
        except asyncio.CancelledError:
            print("process cancelled")
            pass
        except Exception as e:
            print(e)
            raise
    
    
    linear_layer = torch.nn.Linear(1000, 1000)
    
    
    def run_linear_host(x):
        a = linear_layer(x["a"])
        b = linear_layer(x["b"])
        return a + b
    
    
    async def host_func(barrier):
        host = moolib.Rpc()
        host.set_name("host")
        host.listen(local_addr)
        queue = host.define_queue("linear", batch_size=1000, dynamic_batching=True)
    
        barrier.wait()
        print("host process passed the barrier")
        await process(queue, run_linear_host)
    
    
    async def client_func(barrier, pid):
        client = moolib.Rpc()
        barrier.wait()
        client.connect(local_addr)
        print(f"process {pid} connected")
    
        ys = []
        for i in range(100):
            x  = {
                "a": torch.rand(1000),
                "b": torch.rand(1000),
            }
            y = await client.async_("host", "linear", x)
            ys.append(y)
        print("done")
    
    
    
    if __name__ == "__main__":
        num_thread = 10
        barrier = mp.Barrier(num_thread + 1)
    
        host_p = mp.Process(target=lambda: asyncio.run(host_func(barrier)))
        host_p.start()
    
        processes = []
        for i in range(num_thread):
            p = mp.Process(target=lambda: asyncio.run(client_func(barrier, i)))
            p.start()
            processes.append(p)
    
        for p in processes:
            p.join()
    
        host_p.terminate()
    

    The error disappears if I run it with a smaller num_thread=1,2,3 and appears more frequently with larger num_thread.

    I installed the latest main branch with pip install git+https://github.com/facebookresearch/moolib.

    opened by hengyuan-hu 3
  • RPC update; update TensorPipe, enable infiniband, various rpc-related updates & fixes

    RPC update; update TensorPipe, enable infiniband, various rpc-related updates & fixes

    This brings tensorpipe up to date with the latest version. InfiniBand is now enabled by default, and all of the code for handling CUDA tensors is present in the RPC, but CUDA is still disabled by default, as CUDA tensors are not yet supported in all-reduce, and a bit more testing should be done.

    It's a fair bit of code, but among some fixes/changes:

    • Batcher broke in pytorch 1.10, so that's fixed.
    • Properly clean up defined functions and their resources on shutdown.
    • Better? thread scheduling in some places, some deadlocks also fixed/avoided in this way.
    • Some performance improvements, although there are also some regressions. The best performance in the general case seem to still be achieved by doing moolib.set_max_threads(1), and this warrants further investigation.
    CLA Signed 
    opened by tscmoo 4
Releases(v0.0.9c)
  • v0.0.9c(Feb 10, 2022)

    Installing moolib

    Install with pip: pip install moolib.

    See README.md for further instructions.

    New in moolib v0.0.9c

    • Add missing file. (#29, @heiner)
    • Add release and deploy GitHub logic. (#28, @heiner)
    • Add explanation for multiple batch sizes to examples/README.md. (#25, @heiner)
    • update allocators (#22, @tscmoo)
    • update async (#21, @tscmoo)
    • fix silly numpy import/fork bug (#24, @tscmoo)
    • Important change (#23, @heiner)
    • Fix #19 pickle concurrency bug (#20, @tscmoo)
    • Simplify README a bit. (#12, @heiner)
    • Update setup.py, add stuff necessary for pypi. (#18, @heiner)
    • Fix bug for none returns in dynamic batching (#13, @xiaomengy)
    • pickle: implement readinto to fix #14 (#15, @tscmoo)
    • Black 22.1.0 changed its mind about power operators, breaking our test. (#9, @heiner)
    • Fix deadlock #6 (#7, @tscmoo)
    • Update readme (#5, @heiner)
    • Add badges to README. (#3, @heiner)
    Source code(tar.gz)
    Source code(zip)
Owner
Meta Research
Meta Research
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 18 Jun 16, 2022
Nvvl - A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training

NVVL is part of DALI! DALI (Nvidia Data Loading Library) incorporates NVVL functionality and offers much more than that, so it is recommended to switc

NVIDIA Corporation 657 Jun 9, 2022
A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. The main features of this library are: High level API (just a line to cr

null 259 Jun 25, 2022
C++ trainable detection library based on libtorch (or pytorch c++). Yolov4 tiny provided now.

C++ Library with Neural Networks for Object Detection Based on LibTorch. ?? Libtorch Tutorials ?? Visit Libtorch Tutorials Project if you want to know

null 44 Jun 23, 2022
LibtorchSegmentation - A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: VGG, ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

English | 中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. ⭐ Please give a star if this project helps you. ⭐ The main fea

null 259 Jun 25, 2022
Training and fine-tuning YOLOv4 Tiny on custom object detection dataset for Taiwanese traffic

Object Detection on Taiwanese Traffic using YOLOv4 Tiny Exploration of YOLOv4 Tiny on custom Taiwanese traffic dataset Trained and tested AlexeyAB's D

Andrew Chen 3 Mar 7, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator compatible with deep learning frameworks, PyTorch and TensorFlow/Keras, as well as classical machine learning libraries such as scikit-learn, and more.

Microsoft 7k Jun 25, 2022
ResNet Implementation, Training, and Inference Using LibTorch C++ API

LibTorch C++ ResNet CIFAR Example Introduction ResNet implementation, training, and inference using LibTorch C++ API. Because there is no native imple

Lei Mao 20 Jun 20, 2022
Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase.

CFace Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase. Dependancies Tensorflow 2

null 8 Nov 23, 2021
Dorylus: Affordable, Scalable, and Accurate GNN Training

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads This is Dorylus, a Scalable, Resource-eff

UCLASystem 52 Jun 16, 2022
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict outcomes.

Linear-Regression Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict o

vincent laizer 1 Nov 3, 2021
Reactive Light Training Module used in fitness for developing agility and reaction speed.

Hello to you , Thanks for taking interest in this project. Use case of this project is to help people that want to improve their agility and reactio

null 1 Oct 31, 2021
A system to flag anomalous source code expressions by learning typical expressions from training data

A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to niran

Intel Labs 1.2k Jun 27, 2022
Efficient training of deep recommenders on cloud.

HybridBackend Introduction HybridBackend is a training framework for deep recommenders which bridges the gap between evolving cloud infrastructure and

Alibaba 86 Jun 23, 2022
Weekly competitive programming training for newbies (Codeforces problem set)

Codeforces Basic Problem Set Weekly competitive programming training for newbies based on the Codeforces problem set. Note that, this training problem

Nguyen Hoang Hai 4 Apr 22, 2022
HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)

Merlin: HugeCTR HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-T

null 668 Jun 24, 2022
This is a code repository for pytorch c++ (or libtorch) tutorial.

LibtorchTutorials English version 环境 win10 visual sutdio 2017 或者Qt4.11.0 Libtorch 1.7 Opencv4.5 配置 libtorch+Visual Studio和libtorch+QT分别记录libtorch在VS和Q

null 323 Jun 27, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022