Optimized primitives for collective multi-GPU communication

Overview

NCCL

Optimized primitives for inter-GPU communication.

Introduction

NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

For more information on NCCL usage, please refer to the NCCL documentation.

Build

Note: the official and tested builds of NCCL can be downloaded from: https://developer.nvidia.com/nccl. You can skip the following build steps if you choose to use the official builds.

To build the library :

$ cd nccl
$ make -j src.build

If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :

$ make src.build CUDA_HOME=<path to cuda install>

NCCL will be compiled and installed in build/ unless BUILDDIR is set.

By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/common.mk) to only include the architecture of the target platform :

$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

Install

To install NCCL on the system, create a package then install it as root.

Debian/Ubuntu :

$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/

RedHat/CentOS :

$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/

OS-agnostic tarball :

$ make pkg.txz.build
$ ls build/pkg/txz/

Tests

Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>

Copyright

All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.

Comments
  • NCCL 2.6.4 makes system hanging.

    NCCL 2.6.4 makes system hanging.

    Linux: Ubuntu 20.04 LTS GPU driver: newest NVidia driver for linux. CUDA 10.1, CUDNN ,7.6.5 NCCL 2.6.4 Hardware : CPU: Intel 9400f, MB: Z370,Ram : 64GB 2-channel, GPU: 2 2080ti on 2 PCIE 3.0 *8, with a NVlink bridge between them

    I ran all nccl_tests, it seems NCCL is working. But when each test running(about 30 min for each test), the system freezes, I can't switch to browser or doing anything, I can only move the mouse, but the system doesn't respond to mouse-clicking or keyboard input. when the test finishes running, system goes back to normal ,and LOG prints in console.

    log is here:

    #  ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   3795 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   3795 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float     sum     7.18    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
              16             4   float     sum     7.00    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
              32             8   float     sum     7.28    0.00    0.00  0e+00     7.19    0.00    0.00  0e+00
              64            16   float     sum     7.20    0.01    0.01  0e+00     7.05    0.01    0.01  0e+00
             128            32   float     sum     7.30    0.02    0.02  0e+00     7.19    0.02    0.02  0e+00
             256            64   float     sum     7.30    0.04    0.04  0e+00     7.20    0.04    0.04  0e+00
             512           128   float     sum     7.47    0.07    0.07  0e+00     7.12    0.07    0.07  0e+00
            1024           256   float     sum     8.14    0.13    0.13  0e+00     7.92    0.13    0.13  0e+00
            2048           512   float     sum     8.56    0.24    0.24  0e+00     8.43    0.24    0.24  0e+00
            4096          1024   float     sum     9.72    0.42    0.42  0e+00     9.49    0.43    0.43  0e+00
            8192          2048   float     sum    11.99    0.68    0.68  0e+00    11.92    0.69    0.69  0e+00
           16384          4096   float     sum    14.36    1.14    1.14  0e+00    14.21    1.15    1.15  0e+00
           32768          8192   float     sum    16.79    1.95    1.95  0e+00    16.64    1.97    1.97  0e+00
           65536         16384   float     sum    21.14    3.10    3.10  0e+00    20.55    3.19    3.19  0e+00
          131072         32768   float     sum    35.56    3.69    3.69  0e+00    35.43    3.70    3.70  0e+00
          262144         65536   float     sum    41.23    6.36    6.36  0e+00    41.21    6.36    6.36  0e+00
          524288        131072   float     sum    50.66   10.35   10.35  0e+00    50.82   10.32   10.32  0e+00
         1048576        262144   float     sum    72.54   14.45   14.45  0e+00    72.45   14.47   14.47  0e+00
         2097152        524288   float     sum    120.7   17.37   17.37  0e+00    118.4   17.71   17.71  0e+00
         4194304       1048576   float     sum    215.2   19.49   19.49  0e+00    214.7   19.53   19.53  0e+00
         8388608       2097152   float     sum    411.3   20.39   20.39  0e+00    399.1   21.02   21.02  0e+00
        16777216       4194304   float     sum    865.3   19.39   19.39  0e+00    779.6   21.52   21.52  0e+00
        33554432       8388608   float     sum   1547.9   21.68   21.68  0e+00   1699.3   19.75   19.75  0e+00
        67108864      16777216   float     sum   3115.1   21.54   21.54  0e+00   3007.4   22.31   22.31  0e+00
       134217728      33554432   float     sum   5994.3   22.39   22.39  0e+00   5991.9   22.40   22.40  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 7.43886 
    
    /all_gather_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   9119 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   9119 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                             out-of-place                       in-place          
    #       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             1   float     7.14    0.00    0.00  0e+00     7.06    0.00    0.00  0e+00
              16             2   float     7.03    0.00    0.00  0e+00     7.00    0.00    0.00  0e+00
              32             4   float     6.96    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
              64             8   float     7.10    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
             128            16   float     7.10    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
             256            32   float     7.18    0.02    0.02  0e+00     7.23    0.02    0.02  0e+00
             512            64   float     7.49    0.03    0.03  0e+00     7.47    0.03    0.03  0e+00
            1024           128   float     7.03    0.07    0.07  0e+00     6.96    0.07    0.07  0e+00
            2048           256   float     6.97    0.15    0.15  0e+00     6.97    0.15    0.15  0e+00
            4096           512   float     7.41    0.28    0.28  0e+00     7.00    0.29    0.29  0e+00
            8192          1024   float     9.59    0.43    0.43  0e+00     8.80    0.47    0.47  0e+00
           16384          2048   float    11.41    0.72    0.72  0e+00    10.78    0.76    0.76  0e+00
           32768          4096   float    13.39    1.22    1.22  0e+00    11.85    1.38    1.38  0e+00
           65536          8192   float    16.57    1.98    1.98  0e+00    13.83    2.37    2.37  0e+00
          131072         16384   float    23.07    2.84    2.84  0e+00    18.39    3.56    3.56  0e+00
          262144         32768   float    31.38    4.18    4.18  0e+00    30.27    4.33    4.33  0e+00
          524288         65536   float    36.00    7.28    7.28  0e+00    35.30    7.43    7.43  0e+00
         1048576        131072   float    47.38   11.06   11.06  0e+00    46.84   11.19   11.19  0e+00
         2097152        262144   float    70.44   14.89   14.89  0e+00    69.77   15.03   15.03  0e+00
         4194304        524288   float    120.1   17.46   17.46  0e+00    115.5   18.16   18.16  0e+00
         8388608       1048576   float    212.5   19.73   19.73  0e+00    210.2   19.95   19.95  0e+00
        16777216       2097152   float    418.5   20.05   20.05  0e+00    414.0   20.26   20.26  0e+00
        33554432       4194304   float    817.8   20.51   20.51  0e+00    785.1   21.37   21.37  0e+00
        67108864       8388608   float   1568.3   21.40   21.40  0e+00   1560.9   21.50   21.50  0e+00
       134217728      16777216   float   3298.6   20.34   20.34  0e+00   3070.3   21.86   21.86  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 6.6972 
    
    ./broadcast_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid  26256 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid  26256 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type    root     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float       0     7.24    0.00    0.00  0e+00     7.50    0.00    0.00  0e+00
              16             4   float       0     8.31    0.00    0.00  0e+00     7.69    0.00    0.00  0e+00
              32             8   float       0     8.15    0.00    0.00  0e+00     8.23    0.00    0.00  0e+00
              64            16   float       0     7.19    0.01    0.01  0e+00     7.13    0.01    0.01  0e+00
             128            32   float       0     7.25    0.02    0.02  0e+00     7.45    0.02    0.02  0e+00
             256            64   float       0     7.08    0.04    0.04  0e+00     7.16    0.04    0.04  0e+00
             512           128   float       0     7.47    0.07    0.07  0e+00     7.39    0.07    0.07  0e+00
            1024           256   float       0     7.19    0.14    0.14  0e+00    32.19    0.03    0.03  0e+00
            2048           512   float       0     7.36    0.28    0.28  0e+00     7.03    0.29    0.29  0e+00
            4096          1024   float       0     7.25    0.57    0.57  0e+00     7.07    0.58    0.58  0e+00
            8192          2048   float       0     9.11    0.90    0.90  0e+00     8.10    1.01    1.01  0e+00
           16384          4096   float       0    10.97    1.49    1.49  0e+00    10.52    1.56    1.56  0e+00
           32768          8192   float       0    13.36    2.45    2.45  0e+00    11.73    2.79    2.79  0e+00
           65536         16384   float       0    17.03    3.85    3.85  0e+00    14.24    4.60    4.60  0e+00
          131072         32768   float       0    22.66    5.78    5.78  0e+00    22.60    5.80    5.80  0e+00
          262144         65536   float       0    28.48    9.21    9.21  0e+00    28.45    9.21    9.21  0e+00
          524288        131072   float       0    40.26   13.02   13.02  0e+00    40.08   13.08   13.08  0e+00
         1048576        262144   float       0    63.48   16.52   16.52  0e+00    63.19   16.59   16.59  0e+00
         2097152        524288   float       0    110.1   19.04   19.04  0e+00    109.3   19.19   19.19  0e+00
         4194304       1048576   float       0    205.7   20.39   20.39  0e+00    237.1   17.69   17.69  0e+00
         8388608       2097152   float       0    425.1   19.73   19.73  0e+00    386.7   21.69   21.69  0e+00
        16777216       4194304   float       0    815.0   20.59   20.59  0e+00    824.0   20.36   20.36  0e+00
        33554432       8388608   float       0   1536.8   21.83   21.83  0e+00   1508.2   22.25   22.25  0e+00
        67108864      16777216   float       0   3139.2   21.38   21.38  0e+00   3124.3   21.48   21.48  0e+00
       134217728      33554432   float       0   6283.5   21.36   21.36  0e+00   5873.1   22.85   22.85  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 7.99748 
    
    $ ./reduce_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   4810 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   4810 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop    root     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float     sum       0     7.16    0.00    0.00  0e+00     7.35    0.00    0.00  0e+00
              16             4   float     sum       0     7.74    0.00    0.00  0e+00     7.67    0.00    0.00  0e+00
              32             8   float     sum       0     7.08    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
              64            16   float     sum       0     7.13    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
             128            32   float     sum       0     7.15    0.02    0.02  0e+00     7.06    0.02    0.02  0e+00
             256            64   float     sum       0     7.14    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
             512           128   float     sum       0     7.14    0.07    0.07  0e+00     7.11    0.07    0.07  0e+00
            1024           256   float     sum       0     7.09    0.14    0.14  0e+00     7.09    0.14    0.14  0e+00
            2048           512   float     sum       0     7.11    0.29    0.29  0e+00     7.12    0.29    0.29  0e+00
            4096          1024   float     sum       0     7.28    0.56    0.56  0e+00     7.20    0.57    0.57  0e+00
            8192          2048   float     sum       0     8.72    0.94    0.94  0e+00     8.59    0.95    0.95  0e+00
           16384          4096   float     sum       0    10.80    1.52    1.52  0e+00    10.78    1.52    1.52  0e+00
           32768          8192   float     sum       0    12.89    2.54    2.54  0e+00    12.64    2.59    2.59  0e+00
           65536         16384   float     sum       0    16.42    3.99    3.99  0e+00    15.88    4.13    4.13  0e+00
          131072         32768   float     sum       0    23.17    5.66    5.66  0e+00    23.27    5.63    5.63  0e+00
          262144         65536   float     sum       0    29.13    9.00    9.00  0e+00    28.88    9.08    9.08  0e+00
          524288        131072   float     sum       0    40.93   12.81   12.81  0e+00    40.93   12.81   12.81  0e+00
         1048576        262144   float     sum       0    64.30   16.31   16.31  0e+00    64.25   16.32   16.32  0e+00
         2097152        524288   float     sum       0    110.5   18.98   18.98  0e+00    110.6   18.97   18.97  0e+00
         4194304       1048576   float     sum       0    202.1   20.76   20.76  0e+00    202.1   20.76   20.76  0e+00
         8388608       2097152   float     sum       0    386.5   21.70   21.70  0e+00    386.3   21.71   21.71  0e+00
        16777216       4194304   float     sum       0    752.6   22.29   22.29  0e+00    752.5   22.30   22.30  0e+00
        33554432       8388608   float     sum       0   1485.2   22.59   22.59  0e+00   1529.3   21.94   21.94  0e+00
        67108864      16777216   float     sum       0   2947.4   22.77   22.77  0e+00   2945.2   22.79   22.79  0e+00
       134217728      33554432   float     sum       0   5873.8   22.85   22.85  0e+00   5873.8   22.85   22.85  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 8.22671 
    $ ./reduce_scatter_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   5435 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   5435 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             1   float     sum     7.21    0.00    0.00  0e+00     7.28    0.00    0.00  0e+00
              16             2   float     sum     7.12    0.00    0.00  0e+00     7.18    0.00    0.00  0e+00
              32             4   float     sum     7.14    0.00    0.00  0e+00     7.22    0.00    0.00  0e+00
              64             8   float     sum     7.20    0.00    0.00  0e+00     7.15    0.00    0.00  0e+00
             128            16   float     sum     7.14    0.01    0.01  0e+00     7.12    0.01    0.01  0e+00
             256            32   float     sum     7.16    0.02    0.02  0e+00     7.12    0.02    0.02  0e+00
             512            64   float     sum     7.18    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
            1024           128   float     sum     7.53    0.07    0.07  0e+00     7.27    0.07    0.07  0e+00
            2048           256   float     sum     7.28    0.14    0.14  0e+00     7.23    0.14    0.14  0e+00
            4096           512   float     sum     7.64    0.27    0.27  0e+00     7.57    0.27    0.27  0e+00
            8192          1024   float     sum     9.35    0.44    0.44  0e+00     9.24    0.44    0.44  0e+00
           16384          2048   float     sum    11.33    0.72    0.72  0e+00    11.23    0.73    0.73  0e+00
           32768          4096   float     sum    12.66    1.29    1.29  0e+00    12.62    1.30    1.30  0e+00
           65536          8192   float     sum    15.39    2.13    2.13  0e+00    15.31    2.14    2.14  0e+00
          131072         16384   float     sum    21.02    3.12    3.12  0e+00    21.35    3.07    3.07  0e+00
          262144         32768   float     sum    32.36    4.05    4.05  0e+00    31.98    4.10    4.10  0e+00
          524288         65536   float     sum    39.63    6.61    6.61  0e+00    39.76    6.59    6.59  0e+00
         1048576        131072   float     sum    57.11    9.18    9.18  0e+00    56.88    9.22    9.22  0e+00
         2097152        262144   float     sum    92.96   11.28   11.28  0e+00    92.54   11.33   11.33  0e+00
         4194304        524288   float     sum    166.4   12.60   12.60  0e+00    165.9   12.64   12.64  0e+00
         8388608       1048576   float     sum    308.5   13.59   13.59  0e+00    504.4    8.32    8.32  0e+00
        16777216       2097152   float     sum   1050.1    7.99    7.99  0e+00    693.5   12.10   12.10  0e+00
        33554432       4194304   float     sum   1533.4   10.94   10.94  0e+00   1414.8   11.86   11.86  0e+00
        67108864       8388608   float     sum   2529.2   13.27   13.27  0e+00   2314.2   14.50   14.50  0e+00
       134217728      16777216   float     sum   5619.2   11.94   11.94  0e+00   4905.4   13.68   13.68  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 4.44552 
    

    originally I found this issue with training with Tensorflow, I first submit bug to TENSORFLOW , here is the link:https://github.com/tensorflow/tensorflow/issues/40027

    it shows when I remove NVLINK BRIDGE, the TF code runs well , and when I using NVLINK BRIDGE, but not using NCCL, the TF code runs well too. but when I using NCCL and NVLINK BRIDGE, the system halt, make me have to reboot.

    opened by AlexWang1900 37
  • NCCL infiniband performance

    NCCL infiniband performance

    Hi NCCL devs! I have two machines in a cluster communicating over infiniband. There is 400 Gb/sec of bandwidth available between the machines (confirmed with ib_send_bw), but:

    1. nccl-tests only achieves about 20 GB/s, roughly half of what I would expect
    2. there is a decent amount of variance

    running broadcast_perf on 2 machines:

    NCCL_DEBUG=INFO mpiexec -f <hosts file> /root/code/nccl-tests/build/broadcast_perf -b 1M -e 2048M -f 2 -g 1 -c 0 -d half
    

    nccl.txt

    This log shows that (1) nccl is getting between about 15 and 20 GB/s in busbw, and (2) the speed isn't monotonic for larger amounts of data and can change significantly across runs.

    Any ideas on what could be going wrong here? I would expect that I should be getting something closer to 45 GB/s and that there would be more consistency across runs.

    env vars:

    NCCL_IB_HCA=^mlx5_2
    NCCL_SOCKET_IFNAME=eth
    

    ibstatus

    Infiniband device 'mlx5_0' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe72
            base lid:        0x2ed
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_1' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe74
            base lid:        0x5b3
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_2' port 1 status:
            default gid:     0000:0000:0000:0000:0000:0000:0000:0000
            base lid:        0x0
            sm lid:          0x0
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      Ethernet
    
    Infiniband device 'mlx5_3' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c6
            base lid:        0x2f3
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_4' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c8
            base lid:        0x679
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    opened by christopherhesse 34
  • AllReduce hangs

    AllReduce hangs

    My problem was diagnosed in https://github.com/tensorflow/tensorflow/issues/32654 - please find all the info about my environment there.

    Using the master version of nccl. I launch all_reduce_perf and it hangs with 100% volatile GPU usage reported.

    ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4
    # nThread 1 nGpus 4 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid  15833 on jupyter-vmarkovtsev device  0 [0x02] GeForce GTX 1080 Ti
    #   Rank  1 Pid  15833 on jupyter-vmarkovtsev device  1 [0x03] GeForce GTX 1080 Ti
    #   Rank  2 Pid  15833 on jupyter-vmarkovtsev device  2 [0x82] GeForce GTX 1080 Ti
    #   Rank  3 Pid  15833 on jupyter-vmarkovtsev device  3 [0x83] GeForce GTX 1080 Ti
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Bootstrap : Using [0]eth0:10.2.3.32<0>
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    
    jupyter-vmarkovtsev:15833:15833 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Socket : Using [0]eth0:10.2.3.32<0>
    NCCL version 2.4.8+cuda10.0
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO nranks 4
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
    jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Channel 00 :    0   1   2   3
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
    jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
    jupyter-vmarkovtsev:15833:15833 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Launch mode Group/CGMD
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    
    jupyter-vmarkovtsev:15833:15833 [0] init.cc:1250 NCCL WARN Mismatched collective detected, please check your collectivecalls at and around rank 3. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs
    

    I waited for 10 minutes, there are no more logs printed.

    opened by vmarkovtsev 31
  • NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

    NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

    We got stuck using Clara SDK docker image on Kubeflow with multi gpu train. /commands/train_2gpu.sh. It just hangs. Not sure if its plug-in not found issue or our hardware config issue. We are using DGX1 with kubernetes / kubeflow. Please help

    Requested train epochs: 10; iterations: 158

    2020-06-29 22:20:20.310128: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

    Requested train epochs: 10; iterations: 158

    2020-06-29 22:20:24.223690: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

    2020-06-29 22:20:24.816974: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

    ds-ml-01-0:17085:17354 [0] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/IB : No device found.

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

    NCCL version 2.4.8+cuda10.1

    ds-ml-01-0:17086:17353 [1] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/IB : No device found.

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17085:17354 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff

    ds-ml-01-0:17086:17353 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff

    opened by lalithvaka 28
  • Suboptimal performance with TCP over high bandwidth networks

    Suboptimal performance with TCP over high bandwidth networks

    Hi! Many thanks for creating a great framework. NCCL is widely used at our org for scaling the training of ML models and has proved very reliable.

    I am currently trying to figure out how to achieve optimal inter-node performance with NCCL running over TCP on high bandwidth networks (32Gpbs, 100Gpbs, and higher). Even with large message sizes we have not been able to reliably obtain more than 60% of wire speed over 32Gpbs networks (see below for nccl-tests output). From what I've gathered NCCL just hasn't been fully optimized for this configuration yet (although I'm still holding out some hope that I'm just doing it wrong 😄).

    I'm prepared to work fulltime for several weeks on lifting any limitations in the current implementation but I could use a few pointers for getting started. Do you have a sense for what the most promising changes might be and how to incorporate them into the codebase? One thing I might want to explore is using multiple threads/TCP streams. But there is still scope to better utilize a single TCP stream as well so maybe there are some simpler optimizations to try first?

    I've been looking into the codebase and there's a number of things that I don't really understand yet:

    • Running nccl-tests all_reduce_perf -w 0 -n 1 seems to spawn a total of 4 allreduce ops according to my TRACE output. I would have expected just 2 (on for in-place one for out-of-place).
    • I'm not super clear on the control flow/threading model. In my tests NCCL is using exactly two cores, some of main files of interest seem to be net_socket.cc, net.cc, socket.h, enqueue.cc and a lot of cycles are spent polling ncclSocketIrecv/ncclSocketIsend but I'm still struggling with how everything fits together and exactly where/how the actual network transfers happen.

    Some more details on my setup. My current config consists of two GCE machines with 8xV100, plenty of cores/RAM and 32Gbps network (no RDMA). I get about 28Gbps bidirectional bandwidth by running one iperf3 server and client on each node (and >30Gpbs with -Z -P4 flags). Anecdotally, more complex setups that include Horovod have occasionally been able to hit 60% of wire speed on 32Gbps and 50Gbps networks. In this case, running nccl-tests only yields 16Gpbs:

    [email protected]:/# mpirun --allow-run-as-root -H 10.73.0.52:1,10.73.0.15:1 -np 2 -mca btl_tcp_if_include ens12 -x NCCL_IB_DISABLE=1 -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=ens12 -x NCCL_DEBUG=INFO /nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -c 0
    # nThread 1 nGpus 1 minBytes 1073741824 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
    #
    # Using devices
    #   Rank  0 Pid     51 on managed-worker-l83z device  0 [0x00] Tesla V100-SXM2-16GB
    #   Rank  1 Pid     73 on managed-worker-jbk7 device  0 [0x00] Tesla V100-SXM2-16GB
    managed-worker-l83z:51:51 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.52<0>
    managed-worker-l83z:51:51 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    managed-worker-l83z:51:51 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
    NCCL version 2.4.2+cuda10.0
    managed-worker-jbk7:73:73 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.15<0>
    managed-worker-jbk7:73:73 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    managed-worker-jbk7:73:73 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
    managed-worker-l83z:51:57 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
    managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0
    managed-worker-jbk7:73:78 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
    managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0
    managed-worker-l83z:51:57 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
    managed-worker-jbk7:73:78 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
    managed-worker-l83z:51:57 [0] NCCL INFO Channel 00 :    0   1
    managed-worker-l83z:51:57 [0] NCCL INFO Channel 01 :    0   1
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
    managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
    managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    managed-worker-l83z:51:51 [0] NCCL INFO Launch mode Parallel
      1073741824     268435456   float     sum   539383    1.99    1.99    N/A   553087    1.94    1.94    N/A
    managed-worker-l83z:51:51 [0] NCCL INFO Destroyed comm 0x7fd518002560 rank 0
    managed-worker-jbk7:73:73 [0] NCCL INFO Destroyed comm 0x7f9be0002560 rank 1
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 1.96602
    #
    
    opened by cswinter 27
  • NCCL didn't print the right log about connection when enable the GDR

    NCCL didn't print the right log about connection when enable the GDR

    Environment

    • NCCL version 2.5.7+cuda10.0
    • 8 * V100-PCIe per node, a total of 2 nodes

    test command:

    mpirun -np 16 --hostfile ../../hostfile.txt -bind-to none -map-by slot --display-map --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 --mca btl openib,self,vader -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/tmp/debug.log.%h.%p -x NCCL_IB_HCA=mlx5_0:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_READ=0 ./all_reduce_perf -b 8 -e 128M -f 2
    

    Question: When I switched the ENV NCCL_NET_GDR_READ from 0 to 1, the nccl tests showed that the latency is much slower, when the NCCL_NET_GDR_READ was 0, the nccl-tests outpus was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    38.87    0.00    0.00  2e-07    36.96    0.00    0.00  2e-07
              16             4   float     sum    36.45    0.00    0.00  2e-07    36.66    0.00    0.00  1e-07
              32             8   float     sum    36.74    0.00    0.00  1e-07    36.71    0.00    0.00  1e-07
              64            16   float     sum    37.62    0.00    0.00  1e-07    37.03    0.00    0.00  1e-07
             128            32   float     sum    38.05    0.00    0.01  1e-07    38.00    0.00    0.01  1e-07
             256            64   float     sum    38.31    0.01    0.01  6e-08    38.73    0.01    0.01  6e-08
             512           128   float     sum    39.79    0.01    0.02  6e-08    39.00    0.01    0.02  6e-08
            1024           256   float     sum    40.40    0.03    0.05  2e-07    39.96    0.03    0.05  2e-07
            2048           512   float     sum    42.57    0.05    0.09  2e-07    42.42    0.05    0.09  2e-07
            4096          1024   float     sum    73.62    0.06    0.10  5e-07    72.72    0.06    0.11  5e-07
            8192          2048   float     sum    81.68    0.10    0.19  5e-07    80.06    0.10    0.19  5e-07
           16384          4096   float     sum    84.74    0.19    0.36  5e-07    83.30    0.20    0.37  5e-07
           32768          8192   float     sum    90.39    0.36    0.68  5e-07    90.26    0.36    0.68  5e-07
           65536         16384   float     sum    104.2    0.63    1.18  5e-07    102.9    0.64    1.19  5e-07
          131072         32768   float     sum    120.0    1.09    2.05  5e-07    118.6    1.11    2.07  5e-07
          262144         65536   float     sum    218.7    1.20    2.25  5e-07    221.3    1.18    2.22  5e-07
          524288        131072   float     sum    356.1    1.47    2.76  5e-07    355.5    1.47    2.77  5e-07
         1048576        262144   float     sum    479.5    2.19    4.10  5e-07    483.1    2.17    4.07  5e-07
         2097152        524288   float     sum    765.7    2.74    5.14  5e-07    764.2    2.74    5.15  5e-07
         4194304       1048576   float     sum   1428.6    2.94    5.50  5e-07   1425.0    2.94    5.52  5e-07
         8388608       2097152   float     sum   2776.9    3.02    5.66  5e-07   2764.4    3.03    5.69  5e-07
        16777216       4194304   float     sum   5475.1    3.06    5.75  5e-07   5490.5    3.06    5.73  5e-07
        33554432       8388608   float     sum    10886    3.08    5.78  5e-07    10876    3.09    5.78  5e-07
        67108864      16777216   float     sum    37080    1.81    3.39  5e-07    75304    0.89    1.67  5e-07
       134217728      33554432   float     sum    72090    1.86    3.49  5e-07    57255    2.34    4.40  5e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 1.92724
    

    but when the NCCL_NET_GDR_READ was 1, the nccl-tests outpus was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    43.22    0.00    0.00  2e-07    37.00    0.00    0.00  2e-07
              16             4   float     sum    37.34    0.00    0.00  2e-07    37.79    0.00    0.00  1e-07
              32             8   float     sum    37.33    0.00    0.00  1e-07    37.20    0.00    0.00  1e-07
              64            16   float     sum    37.89    0.00    0.00  1e-07    37.73    0.00    0.00  1e-07
             128            32   float     sum    38.61    0.00    0.01  1e-07    38.53    0.00    0.01  1e-07
             256            64   float     sum    43.42    0.01    0.01  6e-08    39.17    0.01    0.01  6e-08
             512           128   float     sum    40.46    0.01    0.02  6e-08    40.32    0.01    0.02  6e-08
            1024           256   float     sum    40.59    0.03    0.05  2e-07    40.28    0.03    0.05  2e-07
            2048           512   float     sum    43.55    0.05    0.09  2e-07    43.05    0.05    0.09  2e-07
            4096          1024   float     sum    73.49    0.06    0.10  5e-07    70.96    0.06    0.11  5e-07
            8192          2048   float     sum    79.89    0.10    0.19  5e-07    79.86    0.10    0.19  5e-07
           16384          4096   float     sum    84.63    0.19    0.36  5e-07    83.82    0.20    0.37  5e-07
           32768          8192   float     sum    93.38    0.35    0.66  5e-07    91.32    0.36    0.67  5e-07
           65536         16384   float     sum    107.4    0.61    1.14  5e-07    104.1    0.63    1.18  5e-07
          131072         32768   float     sum    122.9    1.07    2.00  5e-07    121.7    1.08    2.02  5e-07
          262144         65536   float     sum    225.9    1.16    2.18  5e-07    226.2    1.16    2.17  5e-07
          524288        131072   float     sum    346.8    1.51    2.83  5e-07    345.5    1.52    2.85  5e-07
         1048576        262144   float     sum    428.7    2.45    4.59  5e-07    430.0    2.44    4.57  5e-07
         2097152        524288   float     sum    576.1    3.64    6.83  5e-07    580.9    3.61    6.77  5e-07
         4194304       1048576   float     sum    927.3    4.52    8.48  5e-07    926.1    4.53    8.49  5e-07
         8388608       2097152   float     sum   1678.7    5.00    9.37  5e-07   1683.0    4.98    9.35  5e-07
        16777216       4194304   float     sum   3393.2    4.94    9.27  5e-07   3382.5    4.96    9.30  5e-07
        33554432       8388608   float     sum   7094.9    4.73    8.87  5e-07   7055.8    4.76    8.92  5e-07
        67108864      16777216   float     sum    16353    4.10    7.69  5e-07    16348    4.10    7.70  5e-07
       134217728      33554432   float     sum    32639    4.11    7.71  5e-07    32753    4.10    7.68  5e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 2.89958
    

    If I stop the nv_peer_mem service manualy by run the command: service nv_peer_mem stop,

    Then run the tests with NCCL_NET_GDR_READ=0, the result was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    39.78    0.00    0.00  2e-07    38.16    0.00    0.00  2e-07
              16             4   float     sum    37.00    0.00    0.00  2e-07    37.33    0.00    0.00  1e-07
              32             8   float     sum    37.30    0.00    0.00  1e-07    37.08    0.00    0.00  1e-07
              64            16   float     sum    38.21    0.00    0.00  2e-07    38.90    0.00    0.00  2e-07
             128            32   float     sum    38.55    0.00    0.01  2e-07    38.87    0.00    0.01  2e-07
             256            64   float     sum    39.50    0.01    0.01  2e-07    39.42    0.01    0.01  2e-07
             512           128   float     sum    40.47    0.01    0.02  2e-07    39.91    0.01    0.02  2e-07
            1024           256   float     sum    41.05    0.02    0.05  2e-07    41.08    0.02    0.05  2e-07
            2048           512   float     sum    44.04    0.05    0.09  2e-07    43.84    0.05    0.09  2e-07
            4096          1024   float     sum    48.00    0.09    0.16  2e-07    47.30    0.09    0.16  2e-07
            8192          2048   float     sum    52.58    0.16    0.29  2e-07    51.76    0.16    0.30  2e-07
           16384          4096   float     sum    65.36    0.25    0.47  2e-07    64.10    0.26    0.48  2e-07
           32768          8192   float     sum    90.61    0.36    0.68  2e-07    87.10    0.38    0.71  2e-07
           65536         16384   float     sum    133.1    0.49    0.92  2e-07    258.5    0.25    0.48  2e-07
          131072         32768   float     sum    283.5    0.46    0.87  5e-07    277.1    0.47    0.89  5e-07
          262144         65536   float     sum    307.3    0.85    1.60  5e-07    300.6    0.87    1.63  5e-07
          524288        131072   float     sum    350.6    1.50    2.80  5e-07    353.6    1.48    2.78  5e-07
         1048576        262144   float     sum    475.0    2.21    4.14  5e-07    474.2    2.21    4.15  5e-07
         2097152        524288   float     sum    766.7    2.74    5.13  5e-07    762.5    2.75    5.16  5e-07
         4194304       1048576   float     sum   1453.1    2.89    5.41  5e-07   1451.9    2.89    5.42  5e-07
         8388608       2097152   float     sum   2980.8    2.81    5.28  5e-07   2984.1    2.81    5.27  5e-07
        16777216       4194304   float     sum    71226    0.24    0.44  5e-07   5877.2    2.85    5.35  5e-07
        33554432       8388608   float     sum    12570    2.67    5.01  2e-07    12543    2.68    5.02  2e-07
        67108864      16777216   float     sum    97148    0.69    1.30  2e-07    25695    2.61    4.90  2e-07
       134217728      33554432   float     sum    97671    1.37    2.58  2e-07    69526    1.93    3.62  2e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 1.67461
    

    So, this description that GDR did take effect.

    but the NCCL debug log always is [0] NCCL INFO Ring 00 : 15[41000] -> 0[1b000] [receive] via NET/IB/0

    opened by weberxie 25
  • GPU occupation during model training

    GPU occupation during model training

    Hi,

    Do you have any profiling result about GPU occupation during training ?

    Because I found that NCCL communication overhead arrived 75%, does it normal ?

    check_gpu_utilization

    Thanks

    opened by elevenxiang 23
  • peer mapping resources exhausted for < 8 GPUs

    peer mapping resources exhausted for < 8 GPUs

    I am running a NCCL reduction across multiple GPUs on an Amazon P2 16x instance in a multi-process context (one MPI rank per GPU). When I added small arrays together across 16 workers I got the error "peer mapping resources exhausted". Looking online I determined that perhaps I was limited to 8 GPUs in a group and NCCL wasn't dealing with this limitation internally.

    However, when I reduced between two groups of 8 GPUs using NCCL (by splitting MPI_COMM_WORLD into two separate communicators) and then did a standard MPI reduction in host memory to reduce the remaining two arrays, I got the same error. Same for 7 GPUs. I had to reduce the group size to 4 to get the correct behaviour.

    It seems this is unrelated to the peer ensemble limitation but instead is related to other resources needed for multi-process reductions on a single node.

    Joss Knight

    opened by extabgrad 23
  • NCCL segfaults on single node with 10 GPUs

    NCCL segfaults on single node with 10 GPUs

    I was attempting to use distributed tensorflow when I noticed I could not add the 10th gpu on my node to a distributed strategy... After running nccl-tests, I noticed it appears to be an issue with NCCL.

    [email protected]:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
    # nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid 226099 on node05-ccncluster device  0 [0x1a] TITAN Xp
    #   Rank  1 Pid 226099 on node05-ccncluster device  1 [0x1b] TITAN Xp
    #   Rank  2 Pid 226099 on node05-ccncluster device  2 [0x1c] TITAN Xp
    #   Rank  3 Pid 226099 on node05-ccncluster device  3 [0x1d] TITAN Xp
    #   Rank  4 Pid 226099 on node05-ccncluster device  4 [0x1e] TITAN Xp
    #   Rank  5 Pid 226099 on node05-ccncluster device  5 [0x3d] TITAN Xp
    #   Rank  6 Pid 226099 on node05-ccncluster device  6 [0x3e] TITAN Xp
    #   Rank  7 Pid 226099 on node05-ccncluster device  7 [0x3f] TITAN Xp
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    42.86    0.00    0.00  1e-07    42.51    0.00    0.00  1e-07
              16             4   float     sum    42.46    0.00    0.00  1e-07    43.06    0.00    0.00  1e-07
              32             8   float     sum    42.90    0.00    0.00  6e-08    42.75    0.00    0.00  6e-08
              64            16   float     sum    42.81    0.00    0.00  6e-08    43.06    0.00    0.00  6e-08
             128            32   float     sum    42.81    0.00    0.01  6e-08    42.92    0.00    0.01  6e-08
             256            64   float     sum    43.05    0.01    0.01  3e-08    43.34    0.01    0.01  3e-08
             512           128   float     sum    42.79    0.01    0.02  3e-08    42.65    0.01    0.02  3e-08
            1024           256   float     sum    42.91    0.02    0.04  1e-07    43.00    0.02    0.04  1e-07
            2048           512   float     sum    43.35    0.05    0.08  2e-07    43.25    0.05    0.08  2e-07
            4096          1024   float     sum    43.46    0.09    0.16  2e-07    43.40    0.09    0.17  2e-07
            8192          2048   float     sum    44.38    0.18    0.32  2e-07    43.88    0.19    0.33  2e-07
           16384          4096   float     sum    49.15    0.33    0.58  2e-07    48.86    0.34    0.59  2e-07
           32768          8192   float     sum    72.44    0.45    0.79  2e-07    71.88    0.46    0.80  2e-07
           65536         16384   float     sum    120.5    0.54    0.95  2e-07    121.7    0.54    0.94  2e-07
          131072         32768   float     sum    129.5    1.01    1.77  2e-07    129.5    1.01    1.77  2e-07
          262144         65536   float     sum    157.1    1.67    2.92  2e-07    157.0    1.67    2.92  2e-07
          524288        131072   float     sum    205.4    2.55    4.47  2e-07    205.3    2.55    4.47  2e-07
         1048576        262144   float     sum    305.1    3.44    6.01  2e-07    305.0    3.44    6.02  2e-07
         2097152        524288   float     sum    647.4    3.24    5.67  2e-07    495.1    4.24    7.41  2e-07
         4194304       1048576   float     sum    900.7    4.66    8.15  2e-07    898.9    4.67    8.17  2e-07
         8388608       2097152   float     sum   1735.0    4.83    8.46  2e-07   1718.9    4.88    8.54  2e-07
        16777216       4194304   float     sum   3425.8    4.90    8.57  2e-07   3406.6    4.92    8.62  2e-07
        33554432       8388608   float     sum   6793.3    4.94    8.64  2e-07   6792.5    4.94    8.64  2e-07
        67108864      16777216   float     sum    13579    4.94    8.65  2e-07    13574    4.94    8.65  2e-07
       134217728      33554432   float     sum    27135    4.95    8.66  2e-07    27134    4.95    8.66  2e-07
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 3.0361
    #
    [email protected]:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
    # nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid 226138 on node05-ccncluster device  0 [0x1a] TITAN Xp
    #   Rank  1 Pid 226138 on node05-ccncluster device  1 [0x1b] TITAN Xp
    #   Rank  2 Pid 226138 on node05-ccncluster device  2 [0x1c] TITAN Xp
    #   Rank  3 Pid 226138 on node05-ccncluster device  3 [0x1d] TITAN Xp
    #   Rank  4 Pid 226138 on node05-ccncluster device  4 [0x1e] TITAN Xp
    #   Rank  5 Pid 226138 on node05-ccncluster device  5 [0x3d] TITAN Xp
    #   Rank  6 Pid 226138 on node05-ccncluster device  6 [0x3e] TITAN Xp
    #   Rank  7 Pid 226138 on node05-ccncluster device  7 [0x3f] TITAN Xp
    #   Rank  8 Pid 226138 on node05-ccncluster device  8 [0x40] TITAN Xp
    #   Rank  9 Pid 226138 on node05-ccncluster device  9 [0x41] TITAN Xp
    Segmentation fault (core dumped)
    [email protected]:~/nccl-tests$
    
    opened by mjlbach 22
  • Infiniband regression in NCCL 2.5.6-2

    Infiniband regression in NCCL 2.5.6-2

    I have an application that uses NCCL via PyTorch 1.4 compiled from source with NCCL 2.5.6-2 and crashes early with this error message on all/most ranks:

    NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.5.6

    Some of the ranks are printing out this error message:

    NCCL WARN Call to ibv_modify_qp failed with error No such device

    Running the same code on an image that is identical except that it was built with NCCL 2.4.8-1 runs fine. However, we are unable to use that version of NCCL for larger workloads because our nodes have multiple mellanox devices and we invariably run into https://github.com/NVIDIA/nccl/issues/179.

    Additional datapoints:

    • After startup, the application performs one allreduce operation across all ranks to detect issues which appears to complete successfully despite the warnings.
    • nccl-tests runs to successfully and without printing any warnings.

    As temporary workaround, I would also be interested in backporting the fix to https://github.com/NVIDIA/nccl/issues/179 to NCCL 2.4.8-1 but the referenced commit seems to incorporate a number of different changes. Is there a more minimal diff I could apply to get the fix?

    opened by cswinter 20
  • NCCL Fortran bindings

    NCCL Fortran bindings

    Hi All,

    I've added Fortran bindings for ncclBCast, ncclAllGather and ncclReduce. I plan on adding ncclAllReduce and ncclReduceScatter very soon.

    I've also added tests/samples for using these with floats using both "Fortran Array" and "Pointer" syntax. The tests/samples for ncclAllGather and ncclReduce are for "out-of-place" operation. I plan on adding "in-place" tests/samples soon too.

    Cheers, Kyle

    opened by kylefernandes 20
  • [Question] NCCL WARNS

    [Question] NCCL WARNS

    Hi, I'm DaeHyun. Happy New Year~

    I'm doing multi-node training using Pytorch-based Detectron2. But it's very embarrassing that the training time is very longer than when I use a single node!!

    I use A6000 RTX with 4 GPUS per node. GPUs are normally working in the two nodes. The following is the message when I set NCCL_DEBUG=INFO.

    cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
    
    cvlab189-System-Product-Name:1379562:1379562 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
    cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Using network Socket
    NCCL version 2.10.3+cuda11.3
    cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
    
    cvlab189-System-Product-Name:1379564:1379564 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
    cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Using network Socket
    cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
    
    cvlab189-System-Product-Name:1379565:1379565 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
    cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Using network Socket
    cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
    
    cvlab189-System-Product-Name:1379563:1379563 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
    cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
    cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Using network Socket
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 3[61000] via P2P/IPC
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 2[41000] via P2P/IPC
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 3[61000] via P2P/IPC
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 2[41000] via P2P/IPC
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 4[19000] [send] via NET/Socket/0
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2c000] via P2P/IPC
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2c000] via P2P/IPC
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 4[19000] [send] via NET/Socket/0
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all rings
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 2[41000] via P2P/IPC
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 2[41000] via P2P/IPC
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all rings
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all rings
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all rings
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 1[2c000] via P2P/IPC
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 0[1000] via P2P/IPC
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 1[2c000] via P2P/IPC
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 0[1000] via P2P/IPC
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all trees
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all trees
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 4[19000] [send] via NET/Socket/0
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 4[19000] [send] via NET/Socket/0
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all trees
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all trees
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
    cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO comm 0x7f7590002fb0 rank 0 nranks 8 cudaDev 0 busId 1000 - Init COMPLETE
    cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO comm 0x7f3928002fb0 rank 1 nranks 8 cudaDev 1 busId 2c000 - Init COMPLETE
    cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO comm 0x7f9f00002fb0 rank 3 nranks 8 cudaDev 3 busId 61000 - Init COMPLETE
    cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO comm 0x7fe414002fb0 rank 2 nranks 8 cudaDev 2 busId 41000 - Init COMPLETE
    cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Launch mode Parallel
    

    What should I do?.. I think there's a problem with the message "No plugin found"

    opened by daebakk 0
  • Question: No threadfence_system in LLGenericOp in all_reduce ring implementation

    Question: No threadfence_system in LLGenericOp in all_reduce ring implementation

    I was trying to understand the code flow of all_reduce (when called from nccl_tests/all_reduce_perf) and wanted to confirm a few things. I have run the code on a Google cloud instance with 2 A100 GPUs. AllReduce implementation, ncclKernel_AllReduce_RING_LL_Sum_float is the kernel being called. I believe this kernel is getting constructed using macros in the code common.h.

    The stitched-together kernel calls runRing function from all_reduce.h, which then uses the primitives from prims_ll.h (correct me here if I am mistaken).

    From my understanding, the communication is happening across GPUs in the kernel. As that is the case, there should be some synchronization among the threads across GPUs using threadfence_system. Particularly in the LLGenericOp function. I see the waitSend and postRecv operations which seems synchronizing (some sort and acquire and release operation respectively), but they do not seem to use memory fences.

    Compared to that, SendRecv primitives uses the fence. Am I missing something here?

    opened by nayakajay 1
  • Failure when run with 4 GPUs in each node

    Failure when run with 4 GPUs in each node

    Opening this issue here as well, as described here NVIDIA/nccl-tests#125 Surprisingly, it works with single node, 4 GPUs, but not with 2 nodes.

    Copied and pasted from other issue:

    Hello, I'm running some basic tests to check communication using two nodes. For some reason I don't quite understand, the test passes when there are 1 or 2 GPUs in each node, i.e.: mpirun -H ddp-0,ddp-1 -np 2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

    However, when I try 3 or 4 GPUs, it fails (NCCL_SHM_DISABLE because we have very small SHM allowances so do not attempt to communicate from gpu-to-gpu from filesystem.):

    $ NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO mpirun -H ddp-0,ddp-1 -np 2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
    
    # nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
    #
    # Using devices
    #  Rank  0 Group  0 Pid    263 on      ddp-0 device  0 [0x00] Tesla T4
    #  Rank  1 Group  0 Pid    263 on      ddp-0 device  1 [0x00] Tesla T4
    #  Rank  2 Group  0 Pid    263 on      ddp-0 device  2 [0x00] Tesla T4
    #  Rank  3 Group  0 Pid    263 on      ddp-0 device  3 [0x00] Tesla T4
    #  Rank  4 Group  0 Pid    311 on      ddp-1 device  0 [0x00] Tesla T4
    #  Rank  5 Group  0 Pid    311 on      ddp-1 device  1 [0x00] Tesla T4
    #  Rank  6 Group  0 Pid    311 on      ddp-1 device  2 [0x00] Tesla T4
    #  Rank  7 Group  0 Pid    311 on      ddp-1 device  3 [0x00] Tesla T4
    ddp-0:263:263 [0] NCCL INFO Bootstrap : Using eth0:10.106.68.39<0>
    ddp-0:263:263 [0] NCCL INFO NET/Plugin : Plugin load returned 17 : libnccl-net.so: cannot open shared object file: No such file or directory.
    ddp-0:263:263 [0] NCCL INFO NET/IB : No device found.
    ddp-0:263:263 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.68.39<0>
    ddp-0:263:263 [0] NCCL INFO Using network Socket
    NCCL version 2.10.3+cuda11.4
    ddp-0:263:274 [2] NCCL INFO NCCL_SHM_DISABLE set by environment to 1.
    ddp-0:263:275 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 2/4/-1->3->0
    ddp-0:263:274 [2] NCCL INFO Trees [0] 1/3/-1->2->5 [1] -1/-1/-1->2->3
    ddp-0:263:273 [1] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 0/-1/-1->1->-1
    ddp-0:263:272 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
    ddp-0:263:272 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
    ddp-0:263:272 [0] NCCL INFO Trees [0] 4/-1/-1->0->-1 [1] 3/-1/-1->0->1
    ddp-0:263:273 [1] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] [receive] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] [receive] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] [receive] via NET/Socket/0
    ddp-0:263:273 [1] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] [receive] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] [receive] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 00 : 7[1e0] -> 0[1b0] [receive] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] [receive] via NET/Socket/0
    ddp-0:263:273 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] [send] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] [send] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 01 : 7[1e0] -> 0[1b0] [receive] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Channel 00 : 3[1e0] -> 4[1b0] [send] via NET/Socket/0
    ddp-0:263:273 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] [send] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] [send] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] [send] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Channel 01 : 3[1e0] -> 4[1b0] [send] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Connected all rings
    ddp-0:263:272 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] [send] via NET/Socket/0
    ddp-0:263:273 [1] NCCL INFO Connected all rings
    ddp-0:263:274 [2] NCCL INFO Connected all rings
    ddp-0:263:272 [0] NCCL INFO Connected all rings
    ddp-0:263:275 [3] NCCL INFO Channel 01 : 0[1b0] -> 3[1e0] [receive] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 00 : 2[1d0] -> 5[1c0] [send] via NET/Socket/0
    ddp-0:263:273 [1] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] [receive] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 00 : 5[1c0] -> 2[1d0] [receive] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 01 : 0[1b0] -> 3[1e0] [send] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 00 : 4[1b0] -> 0[1b0] [receive] via NET/Socket/0
    ddp-0:263:273 [1] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] [send] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Channel 01 : 3[1e0] -> 0[1b0] [send] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 00 : 3[1e0] -> 2[1d0] [receive] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 00 : 0[1b0] -> 4[1b0] [send] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 01 : 3[1e0] -> 0[1b0] [receive] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 01 : 3[1e0] -> 2[1d0] [receive] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Channel 01 : 4[1b0] -> 3[1e0] [receive] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Channel 00 : 3[1e0] -> 2[1d0] [send] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] [receive] via NET/Socket/0
    ddp-0:263:274 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] [send] via NET/Socket/0
    ddp-0:263:272 [0] NCCL INFO Connected all trees
    ddp-0:263:272 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
    ddp-0:263:272 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
    ddp-0:263:275 [3] NCCL INFO Channel 01 : 3[1e0] -> 2[1d0] [send] via NET/Socket/0
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-0:263:275 [3] NCCL INFO Call to connect returned Connection refused, retrying
    ddp-1: Test NCCL failure common.cu:958 'unhandled system error'
     .. ddp-1 pid 311: Test failure common.cu:842
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    
    ddp-0:263:275 [3] include/socket.h:409 NCCL WARN Net : Connect to 10.106.71.59<39875> failed : Connection refused
    ddp-0:263:275 [3] NCCL INFO bootstrap.cc:424 -> 2
    ddp-0:263:275 [3] NCCL INFO transport.cc:102 -> 2
    ddp-0:263:275 [3] NCCL INFO init.cc:789 -> 2
    ddp-0:263:275 [3] NCCL INFO init.cc:904 -> 2
    ddp-0:263:275 [3] NCCL INFO group.cc:72 -> 2 [Async thread]
    --------------------------------------------------------------------------
    mpirun detected that one or more processes exited with non-zero status, thus causing
    the job to be terminated. The first process to do so was:
    
      Process name: [[58500,1],1]
      Exit code:    3
    

    Can I get some help with this please?

    opened by dogacancolak 0
  • some more questions NCCL-2.14.3

    some more questions NCCL-2.14.3

    1. stream class naming, strongstream, why use it ? is there a weak stream?
    2. if proxy use LL protocol to transfer data to peer node via GDR, when the peer receive it by not using polling_cq API, what I‘m confused is: it that possible NIC hardware also generate CQE to the cq which can be overflow at last?
    3. in prims_simple.h, what does template parameter “Unroll” mean? I noted that it as a array length “Pack128 vals[UNROLL]”
    4. in ReduceCopy128bMulti, what is preOp and postOp meaning for kinds of redop?
    5. about simple kernel sync question like code below: if (tid < nworkers && offset < nelem) { do { ... barrier(); // This barrier has a counterpart in following loop if (Send && (flags & RolePostSend) && index == 0) __threadfence_system(); //A. in worker threads, it seems no one's role is postsend, what does this code do? __syncwarp(); //B. I don't understand why warpsync needed here? postPeer<Recv, Send>(); //C. in worker threads, it seems no one's role is postsend/recv, what does this code do? offset += sliceSize; slice += 1; } while (slice < SlicePerChunk && offset < nelem); }
    opened by hongbilu 5
  • NCCL Ring algorithm is slower compared to purely MPI based collective algo (No Nccl)

    NCCL Ring algorithm is slower compared to purely MPI based collective algo (No Nccl)

    Hello,

    I am currently running deep learning workload on 2 nodes, both connected with RoCE. while running similar application in different environment, I am getting following speed performance.

    from fastest to slowest

    NCCL (Tree) > MPI > NCCL (Ring).

    I have reran it multiple times but achieving same result. I am using horovod docker image and to test MPI result, I skip nccl installation step from Dockerfile and building docker image.

    Command:

    horovodrun --verbose -np=16 -H=${HOST} --fusion-threshold-mb 256 --cycle-time-ms 2.5 --mpi-args="--bind-to none --map-by numa \
     --mca btl=openib,self,vader btl_openib_cpc_include=rdmacm \
    -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 \
    -x NCCL_DEBUG=INFO -x NCCL_VERSION \
    -x NV_LIBNCCL_DEV_PACKAGE -x NV_LIBNCCL_DEV_PACKAGE_NAME -x NV_LIBNCCL_DEV_PACKAGE_VERSION -x NV_LIBNCCL_PACKAGE -x NV_LIBNCCL_PACKAGE_NAME -x NV_LIBNCCL_PACKAGE_VERSION \
    -x NVARCH -x NVIDIA_DRIVER_CAPABILITIES -x NVIDIA_REQUIRE_CUDA -x NVIDIA_VISIBLE_DEVICES -x NV_CUDA_COMPAT_PACKAGE \
    -x NV_CUDA_CUDART_DEV_VERSION -x NV_CUDA_CUDART_VERSION -x NV_CUDA_LIB_VERSION -x NV_LIBCUBLAS_DEV_PACKAGE -x NV_LIBCUBLAS_DEV_PACKAGE_NAME \
    -x NV_LIBCUBLAS_DEV_VERSION -x NV_LIBCUBLAS_PACKAGE -x NV_LIBCUBLAS_PACKAGE_NAME -x NV_LIBCUBLAS_VERSION -x NV_LIBCUSPARSE_DEV_VERSION -x NV_LIBCUSPARSE_VERSION \
    -x NV_LIBNPP_DEV_PACKAGE -x NV_LIBNPP_DEV_VERSION -x NV_LIBNPP_PACKAGE -x NV_LIBNPP_VERSION -x NV_NVML_DEV_VERSION -x NV_NVTX_VERSION \
    -x NCCL_ALGO=Ring -x NCCL_SOCKET_IFNAME=ens8f0 -x NCCL_SOCKET_NTHREADS=4 -x NCCL_NSOCKS_PERTHREAD=16"
    

    Is it expected rankings? Let me know if any information needed.

    Thank you

    opened by PurvangL 0
  • Make NCCL build with `-std=c++17`

    Make NCCL build with `-std=c++17`

    Currently NCCL sets -std=c++11 in its Makefile. Current releases do not build under -std=c++17:

    e.g., after modifying the Makefile to change the C++ standard:

    In file included from include/core.h:62,
                     from include/info.h:13,
                     from include/graph.h:112,
                     from include/transport.h:11,
                     from include/comm.h:10,
                     from include/enqueue.h:10,
                     from enqueue.cc:7:
    include/nvtx.h: In constructor ‘constexpr payload_schema::payload_schema(const nvtxPayloadSchemaEntry_t*, size_t, uint64_t, const char*)’:
    include/nvtx.h:46:30: error: call to non-‘constexpr’ function ‘uint64_t nvtxPayloadSchemaRegister(nvtxDomainHandle_t, const nvtxPayloadSchemaAttr_t*)’
       46 |     nvtxPayloadSchemaRegister(nvtx3::domain::get<nccl_domain>(), &schema_attr);
          |     ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    In file included from include/nvtx3/nvToolsExtPayload.h:764,
                     from include/nvtx3.hpp:95,
                     from include/nvtx.h:10:
    include/nvtx3/nvtxExtDetail/nvtxExtImplPayload_v1.h:77:28: note: ‘uint64_t nvtxPayloadSchemaRegister(nvtxDomainHandle_t, const nvtxPayloadSchemaAttr_t*)’ declared here
       77 | NVTX_EXT_FN_IMPL(uint64_t, nvtxPayloadSchemaRegister, (nvtxDomainHandle_t domain, const nvtxPayloadSchemaAttr_t* attr), (domain, attr))
          |                            ^~~~~~~~~~~~~~~~~~~~~~~~~
    include/nvtx3/nvtxExtDetail/nvtxExtImplPayload_v1.h:61:39: note: in definition of macro ‘NVTX_EXT_FN_IMPL’
       61 | NVTX_LINKONCE_DEFINE_FUNCTION ret_val fn_name signature { \
    

    While this is a non-standard way to build NCCL, we noticed this in the XLA build which self-builds NCCL using Bazel. It looks to me like a real bug.

    opened by hawkinsp 1
Releases(v1.3.4-1)
Owner
NVIDIA Corporation
NVIDIA Corporation
Thread pool - Thread pool using std::* primitives from C++17, with optional priority queue/greenthreading for POSIX.

thread_pool Thread pool using std::* primitives from C++11. Also includes a class for a priority thread pool. Requires concepts and C++17, including c

Tyler Hardin 77 Dec 30, 2022
A fast multi-producer, multi-consumer lock-free concurrent queue for C++11

moodycamel::ConcurrentQueue An industrial-strength lock-free queue for C++. Note: If all you need is a single-producer, single-consumer queue, I have

Cameron 7.4k Jan 3, 2023
A bounded multi-producer multi-consumer concurrent queue written in C++11

MPMCQueue.h A bounded multi-producer multi-consumer concurrent queue written in C++11. It's battle hardened and used daily in production: In the Frost

Erik Rigtorp 836 Dec 25, 2022
C++11 thread safe, multi-producer, multi-consumer blocking queue, stack & priority queue class

BlockingCollection BlockingCollection is a C++11 thread safe collection class that provides the following features: Modeled after .NET BlockingCollect

Code Ex Machina LLC 50 Nov 23, 2022
This is a study on how to do create a queue via IPC (inter-process communication)

IPC queue This is a study on how to do create a queue via IPC (inter-process communication). Please take a look at the examples of producer and consum

Tarcísio Zotelli Ferraz 1 Nov 28, 2022
Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common

null 360 Dec 27, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Parallella 296 Dec 11, 2022
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 4k Dec 27, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Boost.org 1.4k Jan 5, 2023
OpenCL based GPU accelerated SPH fluid simulation library

libclsph An OpenCL based GPU accelerated SPH fluid simulation library Can I see it in action? Demo #1 Demo #2 Why? Libclsph was created to explore the

null 47 Jul 27, 2022
Patterns and behaviors for GPU computing

moderngpu 2.0 (c) 2016 Sean Baxter You can drop me a line here Full documentation with github wiki under heavy construction. Latest update: 2.12 2016

null 1.4k Jan 5, 2023
stdgpu: Efficient STL-like Data Structures on the GPU

stdgpu: Efficient STL-like Data Structures on the GPU Features | Examples | Documentation | Building | Integration | Contributing | License | Contact

Patrick Stotko 777 Jan 8, 2023
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Boost.org 1.3k Dec 30, 2022
ParallelComputingPlayground - Shows different programming techniques for parallel computing on CPU and GPU

ParallelComputingPlayground Shows different programming techniques for parallel computing on CPU and GPU. Purpose The idea here is to compute a Mandel

Morten Nobel-Jørgensen 2 May 16, 2020
A library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies.

Fiber Tasking Lib This is a library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies. Dependenc

RichieSams 796 Dec 30, 2022
Powerful multi-threaded coroutine dispatcher and parallel execution engine

Quantum Library : A scalable C++ coroutine framework Quantum is a full-featured and powerful C++ framework build on top of the Boost coroutine library

Bloomberg 491 Dec 30, 2022
lc is a fast multi-threaded line counter.

Fast multi-threaded line counter in Modern C++ (2-10x faster than `wc -l` for large files)

Pranav 14 Oct 25, 2022
A library OS for Linux multi-process applications, with Intel SGX support

Graphene Library OS with Intel SGX Support A Linux-compatible Library OS for Multi-Process Applications NOTE: We are in the middle of transitioning ou

The Gramine Project 323 Jan 4, 2023
Multi-backend implementation of SYCL for CPUs and GPUs

hipSYCL - a SYCL implementation for CPUs and GPUs hipSYCL is a modern SYCL implementation targeting CPUs and GPUs, with a focus on leveraging existing

Aksel Alpay 609 Dec 26, 2022