A library for distributed ML training with PyTorch

Overview

moolib

moolib - a communications library for distributed ML training

moolib offers general purpose RPC with automatic transport selection (shared memory, TCP/IP, Infiniband) allowing models to data-parallelise their training and synchronize gradients and model weights across many nodes.

moolib is an RPC library to help you perform distributed machine learning research, particularly reinforcement learning. It is designed to be highly flexible and highly performant.

It is flexible because it allows researchers to define their own training loops and data-collection policies with minimal interference or abstractions - moolib gets out of the way of research code.

It is performant because it gives researchers the power of efficient data-parallelization across GPUs with minimal overhead, in a manner that is highly scalable.

moolib aims to provide researchers with the freedom to implement whatever experiment loop they desire, and the freedom to scale it up from single GPUs to hundreds at will (with no additional code). It ships with a reference implementations IMPALA on Atari that can easily be adapted to other environments or algorithms.

Installing

To compile moolib without CUDA support

EXPORT USE_CUDA=0

To install from GitHub:

pip install git+https://github.com/facebookresearch/moolib

To build from source:

git clone --recursive [email protected]:facebookresearch/moolib
cd moolib && pip install .

How to host docs (after installation):

pip install sphinx==4.1.2
cd docs && ./run_docs.sh

Run an Example

To run the example agent on a given Atari level:

First, start the broker:

python -m moolib.broker

It will output something like Broker listening at 0.0.0.0:4431.

Note that a single broker is enough for all your experiments.

Now take the IP address of your computer. If you ssh'd into your machine, this should work (in a new shell):

export BROKER_IP=$(echo $SSH_CONNECTION | cut -d' ' -f3)  # Should give your machine's IP.
export BROKER_PORT=4431

To start an experiment with a single peer:

python -m examples.vtrace.experiment connect=BROKER_IP:BROKER_PORT \
    savedir=/tmp/moolib-atari/savedir \
    project=moolib-atari \
    group=Zaxxon-Breakout \
    env.name=ALE/Breakout-v5

To add more peers to this experiment, start more processes with the same project and group settings, using a different setting for device (default: 'cuda:0').

Documentation

See moolib's API documentation.

Benchmarks

Show results on Atari

atari_1 atari_2

License

moolib is licensed under the MIT License. See LICENSE for details.

Comments
  • RPC update; update TensorPipe, enable infiniband, various rpc-related updates & fixes

    RPC update; update TensorPipe, enable infiniband, various rpc-related updates & fixes

    This brings tensorpipe up to date with the latest version. InfiniBand is now enabled by default, and all of the code for handling CUDA tensors is present in the RPC, but CUDA is still disabled by default, as CUDA tensors are not yet supported in all-reduce, and a bit more testing should be done.

    It's a fair bit of code, but among some fixes/changes:

    • Batcher broke in pytorch 1.10, so that's fixed.
    • Properly clean up defined functions and their resources on shutdown.
    • Better? thread scheduling in some places, some deadlocks also fixed/avoided in this way.
    • Some performance improvements, although there are also some regressions. The best performance in the general case seem to still be achieved by doing moolib.set_max_threads(1), and this warrants further investigation.
    CLA Signed 
    opened by tscmoo 4
  • More doc, tutorial, and whitepaper

    More doc, tutorial, and whitepaper

    Thank you so much for open-sourcing this library. It looks very useful and interesting. Do you have plans to release more docs, tutorials, and/or whitepaper that explains the design choices? How does moolib compare to alternative frameworks like Ray or PyTorch RPC, in terms of performance benchmarking and API design?

    Many thanks for your help!

    opened by LinxiFan 4
  • broker crashing, rv < 0: network is unreachable

    broker crashing, rv < 0: network is unreachable

    I'm running the IMPALA vtrace example, on one machine with a single peer. Using export BROKER_IP=$(echo $SSH_CONNECTION | cut -d' ' -f3), the moolib.broker crashes at some point in the run (and the peer job also crashes):

    $ python -m moolib.broker
    Broker listening at 0.0.0.0:4431
    terminate called after throwing an instance of 'std::runtime_error'
      what():  In connectFromLoop at .../moolib/src/tensorpipe/tensorpipe/transport/uv/uv.h:313 "rv < 0: network is unreachable"
    Aborted (core dumped)
    

    However, I have not run into this issue yet when starting the peer using BROKER_IP=0.0.0.0.

    opened by etaoxing 3
  • Tests broken

    Tests broken

    Seems like there's some regression resulting in

                obs_future = envs.step(0, action_t)
    >           obs = obs_future.result()
    E           RuntimeError: Timed out waiting for env
    
    examples/a2c.py:216: RuntimeError
    =========================== short test summary info ============================
    FAILED test/integration/test_a2c.py::TestA2CExample::test_single_node_training
    !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
    

    See 1 and 2.

    opened by heiner 3
  • packet contained unknown content: -1

    packet contained unknown content: -1

    I am trying to run parallel worker processes that send dictionary of tensors to an inference process where batched neural network forwarding happens. I get error E0209 20:15:45.299231 2754780 /tmp/pip-req-build-fpmouidj/src/tensorpipe/tensorpipe/core/listener_impl.cc:333] packet contained unknown content: -1. The program runs fine and terminates fine. I believe the results are correct too. Here is the minimal code to reproduce the error.

    import asyncio
    import multiprocessing as mp
    import moolib
    import torch
    
    
    local_addr = "127.0.0.1:4412"
    
    async def process(queue, callback):
        i = 0
        try:
            while True:
                # print("waiting for batch", i)
                i += 1
                return_callback, args, kwargs = await queue
                if args and kwargs:
                    retval = callback(*args, **kwargs)
                elif args:
                    retval = callback(*args)
                elif kwargs:
                    retval = callback(**kwargs)
                else:
                    retval = callback()
                return_callback(retval)
        except asyncio.CancelledError:
            print("process cancelled")
            pass
        except Exception as e:
            print(e)
            raise
    
    
    linear_layer = torch.nn.Linear(1000, 1000)
    
    
    def run_linear_host(x):
        a = linear_layer(x["a"])
        b = linear_layer(x["b"])
        return a + b
    
    
    async def host_func(barrier):
        host = moolib.Rpc()
        host.set_name("host")
        host.listen(local_addr)
        queue = host.define_queue("linear", batch_size=1000, dynamic_batching=True)
    
        barrier.wait()
        print("host process passed the barrier")
        await process(queue, run_linear_host)
    
    
    async def client_func(barrier, pid):
        client = moolib.Rpc()
        barrier.wait()
        client.connect(local_addr)
        print(f"process {pid} connected")
    
        ys = []
        for i in range(100):
            x  = {
                "a": torch.rand(1000),
                "b": torch.rand(1000),
            }
            y = await client.async_("host", "linear", x)
            ys.append(y)
        print("done")
    
    
    
    if __name__ == "__main__":
        num_thread = 10
        barrier = mp.Barrier(num_thread + 1)
    
        host_p = mp.Process(target=lambda: asyncio.run(host_func(barrier)))
        host_p.start()
    
        processes = []
        for i in range(num_thread):
            p = mp.Process(target=lambda: asyncio.run(client_func(barrier, i)))
            p.start()
            processes.append(p)
    
        for p in processes:
            p.join()
    
        host_p.terminate()
    

    The error disappears if I run it with a smaller num_thread=1,2,3 and appears more frequently with larger num_thread.

    I installed the latest main branch with pip install git+https://github.com/facebookresearch/moolib.

    opened by hengyuan-hu 3
  • Question about numGradients in setGradients()

    Question about numGradients in setGradients()

    Why is gradients in setGradients() divided by data.numGradients? Shouldn't it divided by data.batchSize?

            grad.mul_(1.0f / data.numGradients);
    

    data.numGradientsseems to be the number of reduce_gradients() called. data.batchSize is sum of batch_size passed in reduce_gradients(). numGradients doesn't seem to be a good indicator of how many iterations worth of gradients are accumulated.

    For example,

    • set_virtual_batch_size(8)
    • reduce_gradients(batch_size=4) is called -> numGradients += 1
    • reduce_gradients(batch_size=3) is called -> numGradients += 1
    • reduce_gradients(batch_size=2) is called -> numGradients += 1
    • (after gradients all-reduce) data.batchSize=9, data.numGradients=3

    Shouldn't gradients be divided by 9, not 3?

    opened by bgyoon 2
  • Performance regression after PR#26

    Performance regression after PR#26

    It seems there is a performance regression after https://github.com/facebookresearch/moolib/pull/26. The perf regression happened mainly when the machine is busy.

    A simple reproduce code is shown below. I run the code on a learnfair machine, the average tot_time increased from about 4.40ms to 18.50ms.

    from typing import Any, Callable, NoReturn, Tuple
    
    import asyncio
    import time
    
    import numpy as np
    
    import torch
    import torch.nn as nn
    import torch.multiprocessing as mp
    
    import moolib
    
    N = 8192
    addr = "127.0.0.1:1234"
    
    # device = "cpu"
    device = "cuda"
    
    dynamic_batching = True
    num_clients = 32
    
    
    class Server:
        def __init__(self, dynamic_batching: bool = False) -> None:
            self._addr = addr
            self._dynamic_batching = dynamic_batching
    
            self._server = None
            self._linear = None
            self._device = device
    
            self._process = None
            self._loop = None
    
        @property
        def addr(self) -> str:
            return self._addr
    
        @property
        def dynamic_batching(self) -> bool:
            return self._dynamic_batching
    
        def start(self) -> None:
            self._process = mp.Process(target=self.run)
            self._process.start()
    
        def terminate(self) -> None:
            self._process.terminate()
    
        def run(self) -> NoReturn:
            self._linear = nn.Linear(N, N).to(self._device)
            self._server = moolib.Rpc()
            self._server.set_name("server")
            self._server.set_timeout(60)
            self._server.listen(addr)
    
            if self._dynamic_batching:
                self._loop = asyncio.get_event_loop()
                que = self._server.define_queue("run_model",
                                                batch_size=128,
                                                dynamic_batching=True)
                self._loop.create_task(self._async_process(que, self._run_model))
                self._loop.run_forever()
            else:
                self._server.define("run_model", self._run_model)
                while True:
                    time.sleep(1)
    
        def _run_model(self, x: torch.Tensor) -> Tuple[torch.Tensor, float]:
            batch_size = x.size(0) if x.dim() > 1 else None
            t0 = time.perf_counter()
            y = self._linear(x.to(self._device)).sum(dim=-1).cpu()
            t1 = time.perf_counter()
            t = t1 - t0
            if batch_size is not None:
                t = (t, ) * batch_size
            return y, t
    
        async def _async_process(self, que: moolib.Queue,
                                 func: Callable[..., Any]) -> NoReturn:
            try:
                while True:
                    ret_cb, args, kwargs = await que
                    ret = func(*args, **kwargs)
                    ret_cb(ret)
            except asyncio.CancelledError:
                pass
    
    
    class Client:
        def __init__(self, index: int) -> None:
            self._index = index
            self._addr = addr
            self._client = None
            self._process = None
    
        @property
        def index(self) -> int:
            return self._index
    
        def start(self) -> None:
            self._process = mp.Process(target=self.run)
            self._process.start()
    
        def join(self) -> None:
            self._process.join()
    
        def terminate(self) -> None:
            self._process.terminate()
    
        def run(self) -> NoReturn:
            self._client = moolib.Rpc()
            self._client.set_name(f"client-{self._index}")
            self._client.set_timeout(60)
            self._client.connect(self._addr)
    
            x = torch.randn(N)
            num = 10000
    
            stats1 = []
            stats2 = []
            for _ in range(num):
                t0 = time.perf_counter()
                _, t = self._client.sync("server", "run_model", x)
                t1 = time.perf_counter()
                stats1.append(t1 - t0)
                stats2.append(t)
    
            mean1 = np.mean(stats1) * 1000.0
            mean2 = np.mean(stats2) * 1000.0
            print(
                f"[Client-{self._index}], tot_time = {mean1}, run_time = {mean2}")
    
    
    def main() -> None:
        server = Server(dynamic_batching=dynamic_batching)
        server.start()
    
        clients = []
        for i in range(num_clients):
            client = Client(i)
            clients.append(client)
            client.start()
        for client in clients:
            client.join()
    
        server.terminate()
    
    
    if __name__ == "__main__":
        mp.set_start_method("spawn")
        main()
    
    opened by xiaomengy 2
  • Add Dockerfile.

    Add Dockerfile.

    This successfully compiles moolib and runs the a2c.py toy example.

    Perhaps we could add some kind of automation to this? Upload to dockerhub? @condnsdmatters, wdyt?

    CLA Signed 
    opened by heiner 2
  • float in pythonserialization.h

    float in pythonserialization.h

    Shouldn't Python float be casted to double instead of float like image ?

    If you run

    import moolib
    
    def foo(str):
        print(str)
        return 42
    
    host = moolib.Rpc()
    host.set_name("host")
    host.define("bar", foo)
    host.listen("127.0.0.1:1234")
    
    client = moolib.Rpc()
    client.connect("127.0.0.1:1234")
    
    print(client.sync("host", "bar", .00002))
    

    Output

    1.9999999494757503e-05
    42
    

    Expected Output(this is the actual output when float is replaced with double)

    2e-05
    42
    
    opened by bgyoon 1
  • fix silly numpy import/fork bug

    fix silly numpy import/fork bug

    This is pretty silly, but during serialization we check if the input is a numpy array (despite #17). For this check, pybind needs to import numpy. During import, my local version of numpy calls fork. From googling, this is seen as a no-no, but it does it regardless. I've not investigated why it does the fork. Forking during serialization breaks moolib with the usual dont-fork fatal error message (and setting the MOOLIB_ALLOW_FORK env var results in a deadlock). There is no issue if the user already imported numpy (or pytorch, as that also pulls in numpy). However, if they didn't, even the most basic usage of moolib results in a fatal error.

    This fix calls the pybind code that in turn imports numpy during Moolib module initialization, thus numpy is already imported during serialization.

    CLA Signed 
    opened by tscmoo 1
  • Serialization issue for python class with multiple member variables when running in parallel.

    Serialization issue for python class with multiple member variables when running in parallel.

    It seems the fix for https://github.com/facebookresearch/moolib/issues/14 introduced some new issues. When transmitting python classes with multiple member variables and there are multiple clients running in parallel, we will see some errors as below. It seems the issue is from the newly added readinto function. I created a reproduce script listed below. We have to keep on working on this.

    RuntimeError: Remote exception during RPC call (print_data): BufferError: buffer size is 42, but readinto requested 161372 bytes
    
    import asyncio
    import time
    
    from typing import Any, Callable, NoReturn, Optional
    
    import torch
    import torch.multiprocessing as mp
    
    import moolib
    
    addr = "127.0.0.1:4411"
    timeout = 60
    
    
    class DataWrapper:
        def __init__(self,
                     x: Optional[Any] = None,
                     y: Optional[Any] = None) -> None:
            self.x = x
            self.y = y
    
        def __str__(self) -> str:
            return f"[{self.x}]"
    
        def __repr__(self) -> str:
            return self.__str__()
    
    
    def print_data(x: Any) -> None:
        if isinstance(x, tuple):
            print(f"[print_data] batch_size = {len(x)}")
        else:
            print(f"[print_data] data_size = {x.x.size()}")
        time.sleep(0.01)
    
    
    def handle_task_exception(task: asyncio.Task) -> None:
        try:
            task.result()
        except asyncio.CancelledError:
            pass
        except Exception as e:
            raise e
    
    
    async def process(que: moolib.Queue, callback: Callable[..., Any]) -> NoReturn:
        try:
            while True:
                ret_cb, args, kwargs = await que
                if args and kwargs:
                    ret = callback(*args, **kwargs)
                elif args:
                    ret = callback(*args)
                elif kwargs:
                    ret = callback(**kwargs)
                else:
                    ret = callback()
                ret_cb(ret)
        except asyncio.CancelledError:
            print("[Server] process cancelled")
            pass
        except Exception as e:
            print(e)
            raise
    
    
    def server_func() -> NoReturn:
        server = moolib.Rpc()
        server.set_name("server")
        server.set_timeout(timeout)
        server.listen(addr)
    
        loop = asyncio.get_event_loop()
        task = loop.create_task(
            process(server.define_queue("print_data"), print_data))
        # task = loop.create_task(
        #     process(
        #         server.define_queue("print_data",
        #                             batch_size=100,
        #                             dynamic_batching=True), print_data))
        task.add_done_callback(handle_task_exception)
        loop.run_until_complete(task)
    
    
    def client_func(index: int) -> None:
        client = moolib.Rpc()
        client.set_name(f"client-{index}")
        client.set_timeout(timeout)
        client.connect(addr)
    
        x = torch.randn(496 + index, 10)
        y = torch.randn(422 + index, 95)
    
        num = 10
        x_wrapped = DataWrapper(x, y)
        for _ in range(num):
            client.sync("server", "print_data", x_wrapped)
            time.sleep(0.01)
    
    
    def main() -> None:
        server_proc = mp.Process(target=server_func)
        server_proc.start()
    
        num_clients = 4
        client_processes = []
        for i in range(num_clients):
            client_proc = mp.Process(target=client_func, args=(i, ))
            client_proc.start()
            client_processes.append(client_proc)
    
        for client_proc in client_processes:
            client_proc.join()
    
        server_proc.terminate()
    
    
    if __name__ == "__main__":
        mp.set_start_method("spawn")
        main()
    
    opened by xiaomengy 1
  • Note on EnvPool Features

    Note on EnvPool Features

    This is just a note to keep track of some useful EnvPool features:

    • "Retries" - It would be good add an option to allow a certain number of retries if the Env crashes. It can be a little annoying if there is a single env that crashes and brings down training. Ideally, this could return a done, and restart, only truly crashing after say N failures.

    • "Reset" - The envpool doesnt currently have a reset function! It probably should.

    enhancement 
    opened by condnsdmatters 0
  • Atari median human-normalized score

    Atari median human-normalized score

    Hi devs,

    thanks a lot for the great library. It's been observed that moolib improves quite significantly over torchbeast. Great!

    May I know if you have or you could generate the aggregated median human-normalized score curve of all the tested games? I'm wondering how it compares to the original IMPALA paper. Also I'm wondering what makes the improvements, do you have intuitions over it? moolib improves the distributed communication but I don't think it is directly related to dramatic reward improvements. It would be great if you can help me to understand more on it. thanks in advance!

    opened by zhongwen 3
  • EnvPool should detect if its worker processes have ended

    EnvPool should detect if its worker processes have ended

    Right now, any error in the environment, both at startup and stepping time, results in a timeout in EnvPool. We probably should waitpid(2) or something in the child and communicate issues to the (grand)parent.

    opened by heiner 0
  • Running sbatch_experiment.py with `--broker :` breaks Python

    Running sbatch_experiment.py with `--broker :` breaks Python

    When running

    python examples/sbatch_experiment.py --broker $BROKER_IP:$BROKER_PORT -n 32
    

    without setting $BROKER_IP or $BROKER_PORT, moolib causes a hard termination:

    terminate called after throwing an instance of 'std::invalid_argument'
      what():  In createInetSockAddr at ../../src/tensorpipe/tensorpipe/transport/uv/sockaddr.cc:87 ":"
    Aborted (core dumped)
    
    opened by heiner 0
Releases(v0.0.9c)
  • v0.0.9c(Feb 10, 2022)

    Installing moolib

    Install with pip: pip install moolib.

    See README.md for further instructions.

    New in moolib v0.0.9c

    • Add missing file. (#29, @heiner)
    • Add release and deploy GitHub logic. (#28, @heiner)
    • Add explanation for multiple batch sizes to examples/README.md. (#25, @heiner)
    • update allocators (#22, @tscmoo)
    • update async (#21, @tscmoo)
    • fix silly numpy import/fork bug (#24, @tscmoo)
    • Important change (#23, @heiner)
    • Fix #19 pickle concurrency bug (#20, @tscmoo)
    • Simplify README a bit. (#12, @heiner)
    • Update setup.py, add stuff necessary for pypi. (#18, @heiner)
    • Fix bug for none returns in dynamic batching (#13, @xiaomengy)
    • pickle: implement readinto to fix #14 (#15, @tscmoo)
    • Black 22.1.0 changed its mind about power operators, breaking our test. (#9, @heiner)
    • Fix deadlock #6 (#7, @tscmoo)
    • Update readme (#5, @heiner)
    • Add badges to README. (#3, @heiner)
    Source code(tar.gz)
    Source code(zip)
Owner
Meta Research
Meta Research
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 19 Jul 25, 2022
Nvvl - A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training

NVVL is part of DALI! DALI (Nvidia Data Loading Library) incorporates NVVL functionality and offers much more than that, so it is recommended to switc

NVIDIA Corporation 660 Nov 21, 2022
A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. The main features of this library are: High level API (just a line to cr

null 304 Nov 21, 2022
C++ trainable detection library based on libtorch (or pytorch c++). Yolov4 tiny provided now.

C++ Library with Neural Networks for Object Detection Based on LibTorch. ?? Libtorch Tutorials ?? Visit Libtorch Tutorials Project if you want to know

null 61 Nov 21, 2022
LibtorchSegmentation - A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: VGG, ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

English | 中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. ⭐ Please give a star if this project helps you. ⭐ The main fea

null 304 Nov 21, 2022
Training and fine-tuning YOLOv4 Tiny on custom object detection dataset for Taiwanese traffic

Object Detection on Taiwanese Traffic using YOLOv4 Tiny Exploration of YOLOv4 Tiny on custom Taiwanese traffic dataset Trained and tested AlexeyAB's D

Andrew Chen 4 Oct 3, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator compatible with deep learning frameworks, PyTorch and TensorFlow/Keras, as well as classical machine learning libraries such as scikit-learn, and more.

Microsoft 7.8k Nov 19, 2022
ResNet Implementation, Training, and Inference Using LibTorch C++ API

LibTorch C++ ResNet CIFAR Example Introduction ResNet implementation, training, and inference using LibTorch C++ API. Because there is no native imple

Lei Mao 23 Oct 29, 2022
Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase.

CFace Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase. Dependancies Tensorflow 2

null 7 Oct 18, 2022
Dorylus: Affordable, Scalable, and Accurate GNN Training

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads This is Dorylus, a Scalable, Resource-eff

UCLASystem 57 Nov 11, 2022
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict outcomes.

Linear-Regression Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict o

vincent laizer 1 Nov 3, 2021
Reactive Light Training Module used in fitness for developing agility and reaction speed.

Hello to you , Thanks for taking interest in this project. Use case of this project is to help people that want to improve their agility and reactio

null 4 Sep 16, 2022
A system to flag anomalous source code expressions by learning typical expressions from training data

A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to niran

Intel Labs 1.2k Nov 25, 2022
Efficient training of deep recommenders on cloud.

HybridBackend Introduction HybridBackend is a training framework for deep recommenders which bridges the gap between evolving cloud infrastructure and

Alibaba 108 Nov 21, 2022
Weekly competitive programming training for newbies (Codeforces problem set)

Codeforces Basic Problem Set Weekly competitive programming training for newbies based on the Codeforces problem set. Note that, this training problem

Nguyen Hoang Hai 4 Apr 22, 2022
HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)

Merlin: HugeCTR HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-T

null 741 Nov 21, 2022
This is a code repository for pytorch c++ (or libtorch) tutorial.

LibtorchTutorials English version 环境 win10 visual sutdio 2017 或者Qt4.11.0 Libtorch 1.7 Opencv4.5 配置 libtorch+Visual Studio和libtorch+QT分别记录libtorch在VS和Q

null 446 Nov 22, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022