Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales

Overview

Fairring (FAIR + Herring): a faster all-reduce

TL;DR:

  • Using a variation on Amazon’s "Herring" technique, which leverages reduction servers, we can perform the all-reduce collective faster than NCCL:
    • up to 2x as fast as NCCL in microbenchmarks
    • up to 50% speedup in end-to-end training workloads
  • You can use it right away in your project, with no code changes, by installing our PyTorch extension which is a drop-in replacement for NCCL.

Overview

Herring is a technique that was "rediscovered" by Amazon about a year ago, for implementing the all-reduce collective using reduction servers.

Libraries such as NCCL or Gloo implement all-reduce by establishing a ring among all devices and then passing messages on it until all gradients have been propagated everywhere. This requires as many successive rounds of message passing as the total number of GPUs, which leads to poor linear scaling.

Herring, on the other side, uses a client-server approach: every machine sends its own gradient contribution to a global server, which after accumulating all these contributions sends back the final reduced gradient to all clients. Thus a constant number of network hops is performed, regardless of the number of nodes, which ideally should lead to perfect scaling. It’s also straightforward to spread the bandwidth load across an arbitrary number of nodes, by sharding, thus avoiding any bottleneck.

Amazon, Google and ByteDance have all adopted this client-server technique, and Fairring is FAIR's take on it!

Installation

# Install CUDA and the static build of NCCL

conda create -n fairring python=3.9 pip pytorch=1.11.0.dev20211112 cudatoolkit=11.1 -c pytorch-nightly -c nvidia -y
conda activate fairring

cd
git clone --recurse-submodules https://github.com/pytorch/tensorpipe.git
cd tensorpipe
mkdir build
cd build
cmake ../ -GNinja -DTP_USE_CUDA=1 -DTP_BUILD_LIBUV=1 -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=0 -DCMAKE_PREFIX_PATH=$CONDA_PREFIX -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX '-DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 -fPIC'
ninja
ninja install

cd
git clone --recurse-submodules https://github.com/facebookresearch/fairring.git
cd herring
CXX=g++-10 python setup.py install

Usage

Replace:

torch.distributed.init_process_group("nccl")

with:

import fairring

torch.distributed.init_process_group("fairring")

Caveats

  • This initial version of Fairring only supports RDMA (i.e., InfiniBand or RoCE with GPUDirect). We will add support for the TCP and other fabrics shortly.
  • Fairring implements a custom all-reduce method, but for all other methods it delegates to vanilla NCCL. Hence you will not see any difference for FSDP (Fully Sharded Data Parallel), since the implementation only uses reduce-scatter and all-gather operations.
  • Fairring requires the job’s deployment to be very regular: every machine involved in training must have the same number of GPUs, and they must all be used. This is usually the case already.
  • Fairring by default will allocate 50MiB of staging buffers on all GPUs. This can be changed by setting the FAIRRING_TOTAL_MEMORY environment variable to another value (in bytes). Note that this might reduce the parallelism of Fairring.
  • By default Fairring will use a slice size that matches the DDP default bucket size of 25MiB. To change the slice size (e.g., to keep it in sync with the bucket size, if that gets manually changed) one can set the FAIRRING_SLICE_SIZE environment variable (to an amount of bytes).
  • If Fairring fails to detect the network interface to listen on it can be manually specified with the TP_SOCKET_IFNAME environment variable.
  • You’ll need a very recent version of PyTorch, either a prebuilt nightly binary from at least November 12, or a manual build from source from the master branch after this commit.

End-to-end benchmarks

We benchmarked an end to end training run of a 1.37B parameter Transformer model on wikitext-2 data. The benchmarks were run on V100 GPUs with InfiniBand fabric. The figure below shows the throughput comparing both Fairring and NCCL collectives. The y axis represents the throughput in words per second and the x axis represents the number of GPUs used which scale from 8 to 256 (or nodes from 1 to 32). The data labels represent the percentage increase in throughput when Fairring is compared to NCCL.

With Fairring we see an increase in throughput of 52.86% at 32 GPUs and 42.82% at 64 GPUs. The scaling efficiency at 32 GPUs and 64 GPUs is 90% and 72% respectively. In contrast, with NCCL we see the highest scaling efficiency at 16 GPUs with 68%. At smaller scale, Fairring underperforms NCCL when running on 8 - 16 GPUs since the overhead of the 2 hops over network with sharding is slower than NCCL’s all-reduce. We also haven’t optimized for the single machine codepath, where we could skip sending gradients to the server.

Amazon’s experiments with the Roberta model (1.3B) show a throughput increase of 29% at 64 GPUs with a scaling efficiency of 70%. The Herring paper does show results that indicate a 85% scaling efficiency irrespective of the cluster size (64 GPUS - 512 GPUs). Google has only published seeing a throughput increase when training a BERT model from 0.4 to 0.7 words per second when running on 8 NVIDIA A100s.

However, past the 128 GPU mark we do observe that Fairring fails to scale like NCCL. This is unexpected since Fairring is supposed to scale linearly with the number of GPUs. We suspect this behavior is caused by Fairring "over sharding" when running on too many machines, and possibly being bottlenecked on CPU. We observe this same behavior in our microbenchmarks (see below), where we were able to mitigate this by scaling the bucket size with the number of GPUs, and we thus believe a similar solution should help for end-to-end performance too.

end_to_end_benchmark

Our take on Herring

diagram

An all-reduce in Fairring is broken down into two layers: intra-machine and inter-machine. It proceeds as follows:

  • First a local reduction step takes place among the GPUs of each machine. This is done to take advantage of the high bandwidth of the intra-machine NVLinks to "deduplicate" the gradients of the local GPUs so that we have less data to send over the inter-machine InfiniBand, which is slower. We perform this step using the reduce-scatter primitive of NCCL (yes, Fairring internally uses NCCL), which means the result of this local reduction is spread across all GPUs. That’s because each machine has multiple InfiniBand cards, connected to different GPUs, hence we can only reach peak bandwidth if all GPUs participate in the process.
  • Then each gradient is sharded and shared across all machines: each machine sends the first chunk of its gradient to the first machine, and the second chunk to the second machine, and so on. In a sense, this is an all-to-all operation. We do this step using TensorPipe, which was born to serve this kind of point-to-point use cases. To avoid over-sharding, we use a "rails" topology: GPU N on one machine only communicates with GPU N of other machines. This also reduces the total number of pairwise connections we need to establish.
  • Once a GPU has received all gradient chunks from all other GPUs, it adds them together.
  • Then it sends the fully reduced gradient back to all the GPUs from which it had received a chunk.
  • And in the final step we "reverse" the initial reduce-scatter: now each GPU holds a portion of the final gradient, and we need to send this to the other GPUs on that machine, hence we do a local all-gather step among them, again using NCCL.

Some of the steps require a staging area to store some intermediate result. For that, we allocate upfront a certain number of fixed buffers, which we reuse. If the tensor given to Fairring is larger than one of these buffers, the tensor is cut into slices, each of which is processed separately, as if it were its own tensor. The careful reader will have noticed that Fairring does a "three-dimensional" sharding: first a tensor is flattened and cut into slices (as described just above) if it’s too big; then each slice goes through the reduce-scatter within the machine and each GPU gets one "slot" of that slice; finally that slot is split into shards, each of which is sent to a different machine.

Notice that, as the number of machines used for training increases, the number of steps performed by Fairring remains constant. Moreover, the data sent and received by each GPU also remains constant because the increase in the number of machines is perfectly counterbalanced by the decrease in the shard size. This means that no hotspot, congestion or degradation should appear as the job size increases.

Differences with Amazon’s design

If you’ve read the Amazon paper you might have noticed that our version of Herring ended up being different from what they described. The principles that guided our design were for Fairring to be a viable drop-in replacement for NCCL in PyTorch’s distributed data parallel (DDP) training, and we wanted to optimize it for the InfiniBand GPUDirect fabric.

Amazon suggested a clear separation between the client and the reduction servers, with the latter ones hosted on commodity CPU-only machines, which had to be launched and managed separately. Instead, we opted to have each GPU "double" as both a client and a server (thus avoiding extra processes and machines). Moreover, we chose to store gradients on GPU, to take advantage of fast interconnect, at a cost of a slightly higher memory consumption on GPU (i.e., by default Fairring allocates 50MiB). Another consequence of offloading communication to GPUs and InfiniBand is that we don’t need proxy servers or other multithreading techniques on CPU to reach high bandwidths.

Amazon also stresses a lot the role that its Elastic Fabric Adapter (EFA) has in making Herring viable, and particularly its Scalable Reliable Datagram mode and its Equal-Cost Multi-Path routing. Our understanding is that EFA is a technology that tries to mimic InfiniBand while using an Ethernet link layer (cables and switches), basically Amazon’s flavor of RDMA over Converged Ethernet (RoCE), and thus we believe that an InfiniBand network should perform about as well. We’re currently using connection-oriented transfers, rather than datagram-based, and we don’t really know how routing is set up.

Another technique that is given a lot of importance in Amazon’s paper is the balanced fusion buffers. We believe that the bucketing performed by default by PyTorch’s DDP library combined with the sharding done by Fairring internally should be essentially equivalent to that feature.

Better overlapping than NCCL

During the backwards pass, once "enough" gradients have been computed locally they are put by PyTorch’s DDP in a so-called bucket which is then immediately passed to all-reduce, even while the rest of the backwards pass is still running. This is done so that the communication of that bucket can overlap with the computation of further gradients. This saves time and leads to higher average GPU utilization.

However, when using NCCL, all these all-reduce operations are enqueued on the same stream and, due to NCCL’s synchronous/blocking behavior (within the stream), one such operation cannot even start until the previous one has completely finished. Hence there is no overlap between these all-reduce operations. This can cause a backlog of such operations to build up on the stream, and waiting for it to flush will cause a delay between the end of the backward pass and the optimizer step, hence slowing down the job. (In some cases the all-reduce operations complete at a faster rate than the one at which they’re enqueued, and thus they never build up, but this only happens in jobs that are compute-bound, which is not what we’re focusing on here since those jobs won’t benefit from any improvement to communication).

In Fairring we’ve tried hard to allow successive operations to overlap with each other. For some substeps we use NCCL, hence these substeps are still mutually exclusive, but they can run in parallel to other substeps. For the main inter-machine steps we use TensorPipe, which is extremely asynchronous: Fairring code just schedules a transfer with TensorPipe, and will be notified when it terminates, but in the meantime Fairring can go back to do other things, such as scheduling other transfers, which will be carried out by TensorPipe in parallel. Hence the only queuing that will happen is "on the wire", if there’s not enough bandwidth, but we don’t think we can do much about that. (The number of staging buffers that Fairring was allowed to allocate will also impose a limit to concurrency, but that can be tuned).

Microbenchmarks

To gain more insight into some specific aspects of Fairring, we ran a set of microbenchmarks where we explored the effect of changing a single parameter of our algorithm.

Single transfers

First we compared the "one-shot" performance, where we perform a single all-reduce (no queuing, no overlapping) for a tensor of various sizes. In other words, there’s a single bucket which holds the entire input tensor. We ran this on 8 machines.

microbenchmark_1

We see that for small tensors (up to ~6MiB) NCCL is faster, but for larger tensors (25MiB or more) Fairring starts outperforming NCCL. At 100MiB, Fairring is 43% faster than NCCL. In real workloads we expect Fairring to run at a regime of 25MiB per bucket as that’s the default value used by DDP, hence these results suggest good performance there. We suspect the performance degradation for small tensors is due to Fairring ending up "oversharding" and, with each shard being so small, any overhead becomes more prominent. We should still look into it and see if we can improve it somehow.

Overlap/Parallelism

Then we looked at the role of parallelism/overlap in Fairring, to determine how much it helps. We took 64 buckets of 25MiB each (hence 1.6GB total), but we limited the staging memory that Fairring was allowed to allocate to force only a certain number of buckets to be able to be transferred in parallel. This again was run on 64 GPUs.

microbenchmark_2

The "shape" of the results is exactly what we expect. We see that increasing parallelism helps (it decreases the total time) but only up to a point, since presumably eventually some other bottleneck factors will kick in which make further parallelism useless. Here that sweet spot appears to be at 4 or slightly above, at which point Fairring is around 50% faster than NCCL. NCCL doesn’t allow any parallelism, hence we ran it only once and plotted the result as a constant line.

Scaling

Lastly, but most interestingly, we looked at the effects of scale (i.e., number of machines) on performance. We again used 64 buckets of 25MiB each, with max parallelism, and increased the number of machines from 1 (with 8 GPUs) to 32 (with 256 GPUs). We’re plotting the average of multiple jobs in order to counter the variability we experienced due to running on different sets of machines.

microbenchmark_3

  • The left-hand side of the plot is exactly what we expect: on small jobs Fairring doesn’t perform too well (although it’s already faster than NCCL except for single machine, which is a codepath we could optimize but didn’t), but it gets better as the job scales. At peak, Fairring was 2x as fast as NCCL.
  • The right-hand side is more unexpected: NCCL’s performance appears to plateau, which runs contrary to our understanding. We suspect that this could be due to NCCL changing its algorithm from rings to some other topology such as trees when it detects the job being too large. But, more importantly, we see that Fairring’s performance sharply declines and becomes worse than NCCL’s for 256 GPUs. Our theory is that at such large scales we hit again the problem of our shards becoming too small, and thus wasting too much time due to overhead. In order to test this hypothesis we also launched one run where the bucket size increases proportionally to the number of machines (i.e., 25MiB for 8 machines, 50MiB for 16 machines, 100MiB for 32 machines) in order for the shards to remain of the same size. We plotted this and saw that this leads to much better scaling and it ensures Fairring to outperform NCCL with a consistent margin even at large scales.

Team

The Fairring project was started by Anjali Sridhar and is now a collaboration with Luca Wehrstedt and Shen Li.

License

Fairring is BSD licensed, as found in the LICENSE.txt file.

You might also like...
Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase.

CFace Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase. Dependancies Tensorflow 2

Dorylus: Affordable, Scalable, and Accurate GNN Training

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads This is Dorylus, a Scalable, Resource-eff

Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict outcomes.

Linear-Regression Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict o

Reactive Light Training Module used in fitness for developing agility and reaction speed.

Hello to you , Thanks for taking interest in this project. Use case of this project is to help people that want to improve their agility and reactio

A system to flag anomalous source code expressions by learning typical expressions from training data
A system to flag anomalous source code expressions by learning typical expressions from training data

A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to niran

Efficient training of deep recommenders on cloud.
Efficient training of deep recommenders on cloud.

HybridBackend Introduction HybridBackend is a training framework for deep recommenders which bridges the gap between evolving cloud infrastructure and

Nvvl - A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training

NVVL is part of DALI! DALI (Nvidia Data Loading Library) incorporates NVVL functionality and offers much more than that, so it is recommended to switc

Weekly competitive programming training for newbies (Codeforces problem set)

Codeforces Basic Problem Set Weekly competitive programming training for newbies based on the Codeforces problem set. Note that, this training problem

HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)
HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)

Merlin: HugeCTR HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-T

Owner
Meta Research
Meta Research
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 19 Jul 25, 2022
A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems

mpi-histo A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems. T

Raj Shrestha 2 Dec 21, 2021
PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.

PSTensor : Custimized a Tensor Data Structure Compatible with PyTorch and TensorFlow. You may need this software in the following cases. Manage memory

Jiarui Fang 8 Feb 12, 2022
A Blender plug-in for exporting models as N64 display lists.

Xport64 A Blender plug-in for exporting models as N64 display lists. Supports Blender 2.70-2.79 using Cycles Renderer. 2.8+ compatability coming soon.

null 2 Apr 28, 2022
Organ-inspired VST3 plug-ins

MOrgan - "More than Organ" This is a small collection of open-source audio plug-ins inspired by the basic design of Hammond tone-wheel organs. These a

Shane Dunne 7 Oct 18, 2022
Python Plug-in for the Rime Input Method Engine

librime-python 为什么你需要 librime-python? 1. Python 是解释执行的脚本语言,因此可以在 rime 输入法的基础上动态地添加功能,而无需重新编译源代码、重启输入法程序,甚至无需重新部署。 2. 与 librime 本身使用的 C++ 语言相比,Python 提

Ayaka Mikazuki 15 Sep 4, 2022
Training and fine-tuning YOLOv4 Tiny on custom object detection dataset for Taiwanese traffic

Object Detection on Taiwanese Traffic using YOLOv4 Tiny Exploration of YOLOv4 Tiny on custom Taiwanese traffic dataset Trained and tested AlexeyAB's D

Andrew Chen 5 Dec 14, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator compatible with deep learning frameworks, PyTorch and TensorFlow/Keras, as well as classical machine learning libraries such as scikit-learn, and more.

Microsoft 8k Jan 2, 2023
ResNet Implementation, Training, and Inference Using LibTorch C++ API

LibTorch C++ ResNet CIFAR Example Introduction ResNet implementation, training, and inference using LibTorch C++ API. Because there is no native imple

Lei Mao 23 Oct 29, 2022