Optimized primitives for collective multi-GPU communication

Overview

NCCL

Optimized primitives for inter-GPU communication.

Introduction

NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

For more information on NCCL usage, please refer to the NCCL documentation.

Build

Note: the official and tested builds of NCCL can be downloaded from: https://developer.nvidia.com/nccl. You can skip the following build steps if you choose to use the official builds.

To build the library :

$ cd nccl
$ make -j src.build

If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :

$ make src.build CUDA_HOME=<path to cuda install>

NCCL will be compiled and installed in build/ unless BUILDDIR is set.

By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/common.mk) to only include the architecture of the target platform :

$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

Install

To install NCCL on the system, create a package then install it as root.

Debian/Ubuntu :

$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/

RedHat/CentOS :

$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/

OS-agnostic tarball :

$ make pkg.txz.build
$ ls build/pkg/txz/

Tests

Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>

Copyright

All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.

Issues
  • NCCL 2.6.4 makes system hanging.

    NCCL 2.6.4 makes system hanging.

    Linux: Ubuntu 20.04 LTS GPU driver: newest NVidia driver for linux. CUDA 10.1, CUDNN ,7.6.5 NCCL 2.6.4 Hardware : CPU: Intel 9400f, MB: Z370,Ram : 64GB 2-channel, GPU: 2 2080ti on 2 PCIE 3.0 *8, with a NVlink bridge between them

    I ran all nccl_tests, it seems NCCL is working. But when each test running(about 30 min for each test), the system freezes, I can't switch to browser or doing anything, I can only move the mouse, but the system doesn't respond to mouse-clicking or keyboard input. when the test finishes running, system goes back to normal ,and LOG prints in console.

    log is here:

    #  ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   3795 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   3795 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float     sum     7.18    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
              16             4   float     sum     7.00    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
              32             8   float     sum     7.28    0.00    0.00  0e+00     7.19    0.00    0.00  0e+00
              64            16   float     sum     7.20    0.01    0.01  0e+00     7.05    0.01    0.01  0e+00
             128            32   float     sum     7.30    0.02    0.02  0e+00     7.19    0.02    0.02  0e+00
             256            64   float     sum     7.30    0.04    0.04  0e+00     7.20    0.04    0.04  0e+00
             512           128   float     sum     7.47    0.07    0.07  0e+00     7.12    0.07    0.07  0e+00
            1024           256   float     sum     8.14    0.13    0.13  0e+00     7.92    0.13    0.13  0e+00
            2048           512   float     sum     8.56    0.24    0.24  0e+00     8.43    0.24    0.24  0e+00
            4096          1024   float     sum     9.72    0.42    0.42  0e+00     9.49    0.43    0.43  0e+00
            8192          2048   float     sum    11.99    0.68    0.68  0e+00    11.92    0.69    0.69  0e+00
           16384          4096   float     sum    14.36    1.14    1.14  0e+00    14.21    1.15    1.15  0e+00
           32768          8192   float     sum    16.79    1.95    1.95  0e+00    16.64    1.97    1.97  0e+00
           65536         16384   float     sum    21.14    3.10    3.10  0e+00    20.55    3.19    3.19  0e+00
          131072         32768   float     sum    35.56    3.69    3.69  0e+00    35.43    3.70    3.70  0e+00
          262144         65536   float     sum    41.23    6.36    6.36  0e+00    41.21    6.36    6.36  0e+00
          524288        131072   float     sum    50.66   10.35   10.35  0e+00    50.82   10.32   10.32  0e+00
         1048576        262144   float     sum    72.54   14.45   14.45  0e+00    72.45   14.47   14.47  0e+00
         2097152        524288   float     sum    120.7   17.37   17.37  0e+00    118.4   17.71   17.71  0e+00
         4194304       1048576   float     sum    215.2   19.49   19.49  0e+00    214.7   19.53   19.53  0e+00
         8388608       2097152   float     sum    411.3   20.39   20.39  0e+00    399.1   21.02   21.02  0e+00
        16777216       4194304   float     sum    865.3   19.39   19.39  0e+00    779.6   21.52   21.52  0e+00
        33554432       8388608   float     sum   1547.9   21.68   21.68  0e+00   1699.3   19.75   19.75  0e+00
        67108864      16777216   float     sum   3115.1   21.54   21.54  0e+00   3007.4   22.31   22.31  0e+00
       134217728      33554432   float     sum   5994.3   22.39   22.39  0e+00   5991.9   22.40   22.40  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 7.43886 
    
    /all_gather_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   9119 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   9119 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                             out-of-place                       in-place          
    #       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             1   float     7.14    0.00    0.00  0e+00     7.06    0.00    0.00  0e+00
              16             2   float     7.03    0.00    0.00  0e+00     7.00    0.00    0.00  0e+00
              32             4   float     6.96    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
              64             8   float     7.10    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
             128            16   float     7.10    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
             256            32   float     7.18    0.02    0.02  0e+00     7.23    0.02    0.02  0e+00
             512            64   float     7.49    0.03    0.03  0e+00     7.47    0.03    0.03  0e+00
            1024           128   float     7.03    0.07    0.07  0e+00     6.96    0.07    0.07  0e+00
            2048           256   float     6.97    0.15    0.15  0e+00     6.97    0.15    0.15  0e+00
            4096           512   float     7.41    0.28    0.28  0e+00     7.00    0.29    0.29  0e+00
            8192          1024   float     9.59    0.43    0.43  0e+00     8.80    0.47    0.47  0e+00
           16384          2048   float    11.41    0.72    0.72  0e+00    10.78    0.76    0.76  0e+00
           32768          4096   float    13.39    1.22    1.22  0e+00    11.85    1.38    1.38  0e+00
           65536          8192   float    16.57    1.98    1.98  0e+00    13.83    2.37    2.37  0e+00
          131072         16384   float    23.07    2.84    2.84  0e+00    18.39    3.56    3.56  0e+00
          262144         32768   float    31.38    4.18    4.18  0e+00    30.27    4.33    4.33  0e+00
          524288         65536   float    36.00    7.28    7.28  0e+00    35.30    7.43    7.43  0e+00
         1048576        131072   float    47.38   11.06   11.06  0e+00    46.84   11.19   11.19  0e+00
         2097152        262144   float    70.44   14.89   14.89  0e+00    69.77   15.03   15.03  0e+00
         4194304        524288   float    120.1   17.46   17.46  0e+00    115.5   18.16   18.16  0e+00
         8388608       1048576   float    212.5   19.73   19.73  0e+00    210.2   19.95   19.95  0e+00
        16777216       2097152   float    418.5   20.05   20.05  0e+00    414.0   20.26   20.26  0e+00
        33554432       4194304   float    817.8   20.51   20.51  0e+00    785.1   21.37   21.37  0e+00
        67108864       8388608   float   1568.3   21.40   21.40  0e+00   1560.9   21.50   21.50  0e+00
       134217728      16777216   float   3298.6   20.34   20.34  0e+00   3070.3   21.86   21.86  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 6.6972 
    
    ./broadcast_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid  26256 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid  26256 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type    root     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float       0     7.24    0.00    0.00  0e+00     7.50    0.00    0.00  0e+00
              16             4   float       0     8.31    0.00    0.00  0e+00     7.69    0.00    0.00  0e+00
              32             8   float       0     8.15    0.00    0.00  0e+00     8.23    0.00    0.00  0e+00
              64            16   float       0     7.19    0.01    0.01  0e+00     7.13    0.01    0.01  0e+00
             128            32   float       0     7.25    0.02    0.02  0e+00     7.45    0.02    0.02  0e+00
             256            64   float       0     7.08    0.04    0.04  0e+00     7.16    0.04    0.04  0e+00
             512           128   float       0     7.47    0.07    0.07  0e+00     7.39    0.07    0.07  0e+00
            1024           256   float       0     7.19    0.14    0.14  0e+00    32.19    0.03    0.03  0e+00
            2048           512   float       0     7.36    0.28    0.28  0e+00     7.03    0.29    0.29  0e+00
            4096          1024   float       0     7.25    0.57    0.57  0e+00     7.07    0.58    0.58  0e+00
            8192          2048   float       0     9.11    0.90    0.90  0e+00     8.10    1.01    1.01  0e+00
           16384          4096   float       0    10.97    1.49    1.49  0e+00    10.52    1.56    1.56  0e+00
           32768          8192   float       0    13.36    2.45    2.45  0e+00    11.73    2.79    2.79  0e+00
           65536         16384   float       0    17.03    3.85    3.85  0e+00    14.24    4.60    4.60  0e+00
          131072         32768   float       0    22.66    5.78    5.78  0e+00    22.60    5.80    5.80  0e+00
          262144         65536   float       0    28.48    9.21    9.21  0e+00    28.45    9.21    9.21  0e+00
          524288        131072   float       0    40.26   13.02   13.02  0e+00    40.08   13.08   13.08  0e+00
         1048576        262144   float       0    63.48   16.52   16.52  0e+00    63.19   16.59   16.59  0e+00
         2097152        524288   float       0    110.1   19.04   19.04  0e+00    109.3   19.19   19.19  0e+00
         4194304       1048576   float       0    205.7   20.39   20.39  0e+00    237.1   17.69   17.69  0e+00
         8388608       2097152   float       0    425.1   19.73   19.73  0e+00    386.7   21.69   21.69  0e+00
        16777216       4194304   float       0    815.0   20.59   20.59  0e+00    824.0   20.36   20.36  0e+00
        33554432       8388608   float       0   1536.8   21.83   21.83  0e+00   1508.2   22.25   22.25  0e+00
        67108864      16777216   float       0   3139.2   21.38   21.38  0e+00   3124.3   21.48   21.48  0e+00
       134217728      33554432   float       0   6283.5   21.36   21.36  0e+00   5873.1   22.85   22.85  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 7.99748 
    
    $ ./reduce_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   4810 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   4810 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop    root     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             2   float     sum       0     7.16    0.00    0.00  0e+00     7.35    0.00    0.00  0e+00
              16             4   float     sum       0     7.74    0.00    0.00  0e+00     7.67    0.00    0.00  0e+00
              32             8   float     sum       0     7.08    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
              64            16   float     sum       0     7.13    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
             128            32   float     sum       0     7.15    0.02    0.02  0e+00     7.06    0.02    0.02  0e+00
             256            64   float     sum       0     7.14    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
             512           128   float     sum       0     7.14    0.07    0.07  0e+00     7.11    0.07    0.07  0e+00
            1024           256   float     sum       0     7.09    0.14    0.14  0e+00     7.09    0.14    0.14  0e+00
            2048           512   float     sum       0     7.11    0.29    0.29  0e+00     7.12    0.29    0.29  0e+00
            4096          1024   float     sum       0     7.28    0.56    0.56  0e+00     7.20    0.57    0.57  0e+00
            8192          2048   float     sum       0     8.72    0.94    0.94  0e+00     8.59    0.95    0.95  0e+00
           16384          4096   float     sum       0    10.80    1.52    1.52  0e+00    10.78    1.52    1.52  0e+00
           32768          8192   float     sum       0    12.89    2.54    2.54  0e+00    12.64    2.59    2.59  0e+00
           65536         16384   float     sum       0    16.42    3.99    3.99  0e+00    15.88    4.13    4.13  0e+00
          131072         32768   float     sum       0    23.17    5.66    5.66  0e+00    23.27    5.63    5.63  0e+00
          262144         65536   float     sum       0    29.13    9.00    9.00  0e+00    28.88    9.08    9.08  0e+00
          524288        131072   float     sum       0    40.93   12.81   12.81  0e+00    40.93   12.81   12.81  0e+00
         1048576        262144   float     sum       0    64.30   16.31   16.31  0e+00    64.25   16.32   16.32  0e+00
         2097152        524288   float     sum       0    110.5   18.98   18.98  0e+00    110.6   18.97   18.97  0e+00
         4194304       1048576   float     sum       0    202.1   20.76   20.76  0e+00    202.1   20.76   20.76  0e+00
         8388608       2097152   float     sum       0    386.5   21.70   21.70  0e+00    386.3   21.71   21.71  0e+00
        16777216       4194304   float     sum       0    752.6   22.29   22.29  0e+00    752.5   22.30   22.30  0e+00
        33554432       8388608   float     sum       0   1485.2   22.59   22.59  0e+00   1529.3   21.94   21.94  0e+00
        67108864      16777216   float     sum       0   2947.4   22.77   22.77  0e+00   2945.2   22.79   22.79  0e+00
       134217728      33554432   float     sum       0   5873.8   22.85   22.85  0e+00   5873.8   22.85   22.85  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 8.22671 
    $ ./reduce_scatter_perf -b 8 -e 128M -f 2 -g 2
    # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
    #
    # Using devices
    #   Rank  0 Pid   5435 on w-system device  0 [0x01] GeForce RTX 2080 Ti
    #   Rank  1 Pid   5435 on w-system device  1 [0x02] GeForce RTX 2080 Ti
    #
    #                                                     out-of-place                       in-place          
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
               8             1   float     sum     7.21    0.00    0.00  0e+00     7.28    0.00    0.00  0e+00
              16             2   float     sum     7.12    0.00    0.00  0e+00     7.18    0.00    0.00  0e+00
              32             4   float     sum     7.14    0.00    0.00  0e+00     7.22    0.00    0.00  0e+00
              64             8   float     sum     7.20    0.00    0.00  0e+00     7.15    0.00    0.00  0e+00
             128            16   float     sum     7.14    0.01    0.01  0e+00     7.12    0.01    0.01  0e+00
             256            32   float     sum     7.16    0.02    0.02  0e+00     7.12    0.02    0.02  0e+00
             512            64   float     sum     7.18    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
            1024           128   float     sum     7.53    0.07    0.07  0e+00     7.27    0.07    0.07  0e+00
            2048           256   float     sum     7.28    0.14    0.14  0e+00     7.23    0.14    0.14  0e+00
            4096           512   float     sum     7.64    0.27    0.27  0e+00     7.57    0.27    0.27  0e+00
            8192          1024   float     sum     9.35    0.44    0.44  0e+00     9.24    0.44    0.44  0e+00
           16384          2048   float     sum    11.33    0.72    0.72  0e+00    11.23    0.73    0.73  0e+00
           32768          4096   float     sum    12.66    1.29    1.29  0e+00    12.62    1.30    1.30  0e+00
           65536          8192   float     sum    15.39    2.13    2.13  0e+00    15.31    2.14    2.14  0e+00
          131072         16384   float     sum    21.02    3.12    3.12  0e+00    21.35    3.07    3.07  0e+00
          262144         32768   float     sum    32.36    4.05    4.05  0e+00    31.98    4.10    4.10  0e+00
          524288         65536   float     sum    39.63    6.61    6.61  0e+00    39.76    6.59    6.59  0e+00
         1048576        131072   float     sum    57.11    9.18    9.18  0e+00    56.88    9.22    9.22  0e+00
         2097152        262144   float     sum    92.96   11.28   11.28  0e+00    92.54   11.33   11.33  0e+00
         4194304        524288   float     sum    166.4   12.60   12.60  0e+00    165.9   12.64   12.64  0e+00
         8388608       1048576   float     sum    308.5   13.59   13.59  0e+00    504.4    8.32    8.32  0e+00
        16777216       2097152   float     sum   1050.1    7.99    7.99  0e+00    693.5   12.10   12.10  0e+00
        33554432       4194304   float     sum   1533.4   10.94   10.94  0e+00   1414.8   11.86   11.86  0e+00
        67108864       8388608   float     sum   2529.2   13.27   13.27  0e+00   2314.2   14.50   14.50  0e+00
       134217728      16777216   float     sum   5619.2   11.94   11.94  0e+00   4905.4   13.68   13.68  0e+00
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 4.44552 
    

    originally I found this issue with training with Tensorflow, I first submit bug to TENSORFLOW , here is the link:https://github.com/tensorflow/tensorflow/issues/40027

    it shows when I remove NVLINK BRIDGE, the TF code runs well , and when I using NVLINK BRIDGE, but not using NCCL, the TF code runs well too. but when I using NCCL and NVLINK BRIDGE, the system halt, make me have to reboot.

    opened by AlexWang1900 37
  • NCCL infiniband performance

    NCCL infiniband performance

    Hi NCCL devs! I have two machines in a cluster communicating over infiniband. There is 400 Gb/sec of bandwidth available between the machines (confirmed with ib_send_bw), but:

    1. nccl-tests only achieves about 20 GB/s, roughly half of what I would expect
    2. there is a decent amount of variance

    running broadcast_perf on 2 machines:

    NCCL_DEBUG=INFO mpiexec -f <hosts file> /root/code/nccl-tests/build/broadcast_perf -b 1M -e 2048M -f 2 -g 1 -c 0 -d half
    

    nccl.txt

    This log shows that (1) nccl is getting between about 15 and 20 GB/s in busbw, and (2) the speed isn't monotonic for larger amounts of data and can change significantly across runs.

    Any ideas on what could be going wrong here? I would expect that I should be getting something closer to 45 GB/s and that there would be more consistency across runs.

    env vars:

    NCCL_IB_HCA=^mlx5_2
    NCCL_SOCKET_IFNAME=eth
    

    ibstatus

    Infiniband device 'mlx5_0' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe72
            base lid:        0x2ed
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_1' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d4:fe74
            base lid:        0x5b3
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_2' port 1 status:
            default gid:     0000:0000:0000:0000:0000:0000:0000:0000
            base lid:        0x0
            sm lid:          0x0
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      Ethernet
    
    Infiniband device 'mlx5_3' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c6
            base lid:        0x2f3
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    Infiniband device 'mlx5_4' port 1 status:
            default gid:     fe80:0000:0000:0000:b859:9f03:00d5:04c8
            base lid:        0x679
            sm lid:          0x10d
            state:           4: ACTIVE
            phys state:      5: LinkUp
            rate:            100 Gb/sec (4X EDR)
            link_layer:      InfiniBand
    
    opened by christopherhesse 34
  • AllReduce hangs

    AllReduce hangs

    My problem was diagnosed in https://github.com/tensorflow/tensorflow/issues/32654 - please find all the info about my environment there.

    Using the master version of nccl. I launch all_reduce_perf and it hangs with 100% volatile GPU usage reported.

    ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4
    # nThread 1 nGpus 4 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid  15833 on jupyter-vmarkovtsev device  0 [0x02] GeForce GTX 1080 Ti
    #   Rank  1 Pid  15833 on jupyter-vmarkovtsev device  1 [0x03] GeForce GTX 1080 Ti
    #   Rank  2 Pid  15833 on jupyter-vmarkovtsev device  2 [0x82] GeForce GTX 1080 Ti
    #   Rank  3 Pid  15833 on jupyter-vmarkovtsev device  3 [0x83] GeForce GTX 1080 Ti
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Bootstrap : Using [0]eth0:10.2.3.32<0>
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    
    jupyter-vmarkovtsev:15833:15833 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Socket : Using [0]eth0:10.2.3.32<0>
    NCCL version 2.4.8+cuda10.0
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO nranks 4
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
    jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Channel 00 :    0   1   2   3
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
    jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
    jupyter-vmarkovtsev:15833:15833 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
    jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Launch mode Group/CGMD
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
    jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
    
    jupyter-vmarkovtsev:15833:15833 [0] init.cc:1250 NCCL WARN Mismatched collective detected, please check your collectivecalls at and around rank 3. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs
    

    I waited for 10 minutes, there are no more logs printed.

    opened by vmarkovtsev 31
  • NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

    NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

    We got stuck using Clara SDK docker image on Kubeflow with multi gpu train. /commands/train_2gpu.sh. It just hangs. Not sure if its plug-in not found issue or our hardware config issue. We are using DGX1 with kubernetes / kubeflow. Please help

    Requested train epochs: 10; iterations: 158

    2020-06-29 22:20:20.310128: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

    Requested train epochs: 10; iterations: 158

    2020-06-29 22:20:24.223690: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

    2020-06-29 22:20:24.816974: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

    ds-ml-01-0:17085:17354 [0] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/IB : No device found.

    ds-ml-01-0:17085:17354 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

    NCCL version 2.4.8+cuda10.1

    ds-ml-01-0:17086:17353 [1] NCCL INFO Bootstrap : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/IB : No device found.

    ds-ml-01-0:17086:17353 [1] NCCL INFO NET/Socket : Using [0]eth0:10.233.66.147<0>

    ds-ml-01-0:17085:17354 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff

    ds-ml-01-0:17086:17353 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff

    opened by lalithvaka 28
  • Suboptimal performance with TCP over high bandwidth networks

    Suboptimal performance with TCP over high bandwidth networks

    Hi! Many thanks for creating a great framework. NCCL is widely used at our org for scaling the training of ML models and has proved very reliable.

    I am currently trying to figure out how to achieve optimal inter-node performance with NCCL running over TCP on high bandwidth networks (32Gpbs, 100Gpbs, and higher). Even with large message sizes we have not been able to reliably obtain more than 60% of wire speed over 32Gpbs networks (see below for nccl-tests output). From what I've gathered NCCL just hasn't been fully optimized for this configuration yet (although I'm still holding out some hope that I'm just doing it wrong 😄).

    I'm prepared to work fulltime for several weeks on lifting any limitations in the current implementation but I could use a few pointers for getting started. Do you have a sense for what the most promising changes might be and how to incorporate them into the codebase? One thing I might want to explore is using multiple threads/TCP streams. But there is still scope to better utilize a single TCP stream as well so maybe there are some simpler optimizations to try first?

    I've been looking into the codebase and there's a number of things that I don't really understand yet:

    • Running nccl-tests all_reduce_perf -w 0 -n 1 seems to spawn a total of 4 allreduce ops according to my TRACE output. I would have expected just 2 (on for in-place one for out-of-place).
    • I'm not super clear on the control flow/threading model. In my tests NCCL is using exactly two cores, some of main files of interest seem to be net_socket.cc, net.cc, socket.h, enqueue.cc and a lot of cycles are spent polling ncclSocketIrecv/ncclSocketIsend but I'm still struggling with how everything fits together and exactly where/how the actual network transfers happen.

    Some more details on my setup. My current config consists of two GCE machines with 8xV100, plenty of cores/RAM and 32Gbps network (no RDMA). I get about 28Gbps bidirectional bandwidth by running one iperf3 server and client on each node (and >30Gpbs with -Z -P4 flags). Anecdotally, more complex setups that include Horovod have occasionally been able to hit 60% of wire speed on 32Gbps and 50Gbps networks. In this case, running nccl-tests only yields 16Gpbs:

    [email protected]:/# mpirun --allow-run-as-root -H 10.73.0.52:1,10.73.0.15:1 -np 2 -mca btl_tcp_if_include ens12 -x NCCL_IB_DISABLE=1 -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=ens12 -x NCCL_DEBUG=INFO /nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -c 0
    # nThread 1 nGpus 1 minBytes 1073741824 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
    #
    # Using devices
    #   Rank  0 Pid     51 on managed-worker-l83z device  0 [0x00] Tesla V100-SXM2-16GB
    #   Rank  1 Pid     73 on managed-worker-jbk7 device  0 [0x00] Tesla V100-SXM2-16GB
    managed-worker-l83z:51:51 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.52<0>
    managed-worker-l83z:51:51 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    managed-worker-l83z:51:51 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
    NCCL version 2.4.2+cuda10.0
    managed-worker-jbk7:73:73 [0] NCCL INFO NET/Socket : Using [0]ens12:10.73.0.15<0>
    managed-worker-jbk7:73:73 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
    managed-worker-jbk7:73:73 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
    managed-worker-l83z:51:57 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
    managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0
    managed-worker-jbk7:73:78 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
    managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0
    managed-worker-l83z:51:57 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
    managed-worker-jbk7:73:78 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
    managed-worker-l83z:51:57 [0] NCCL INFO Channel 00 :    0   1
    managed-worker-l83z:51:57 [0] NCCL INFO Channel 01 :    0   1
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
    managed-worker-jbk7:73:78 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0
    managed-worker-l83z:51:57 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
    managed-worker-l83z:51:57 [0] NCCL INFO comm 0x7fd518002560 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
    managed-worker-jbk7:73:78 [0] NCCL INFO comm 0x7f9be0002560 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    managed-worker-l83z:51:51 [0] NCCL INFO Launch mode Parallel
      1073741824     268435456   float     sum   539383    1.99    1.99    N/A   553087    1.94    1.94    N/A
    managed-worker-l83z:51:51 [0] NCCL INFO Destroyed comm 0x7fd518002560 rank 0
    managed-worker-jbk7:73:73 [0] NCCL INFO Destroyed comm 0x7f9be0002560 rank 1
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 1.96602
    #
    
    opened by cswinter 27
  • NCCL didn't print the right log about connection when enable the GDR

    NCCL didn't print the right log about connection when enable the GDR

    Environment

    • NCCL version 2.5.7+cuda10.0
    • 8 * V100-PCIe per node, a total of 2 nodes

    test command:

    mpirun -np 16 --hostfile ../../hostfile.txt -bind-to none -map-by slot --display-map --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 --mca btl openib,self,vader -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/tmp/debug.log.%h.%p -x NCCL_IB_HCA=mlx5_0:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_READ=0 ./all_reduce_perf -b 8 -e 128M -f 2
    

    Question: When I switched the ENV NCCL_NET_GDR_READ from 0 to 1, the nccl tests showed that the latency is much slower, when the NCCL_NET_GDR_READ was 0, the nccl-tests outpus was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    38.87    0.00    0.00  2e-07    36.96    0.00    0.00  2e-07
              16             4   float     sum    36.45    0.00    0.00  2e-07    36.66    0.00    0.00  1e-07
              32             8   float     sum    36.74    0.00    0.00  1e-07    36.71    0.00    0.00  1e-07
              64            16   float     sum    37.62    0.00    0.00  1e-07    37.03    0.00    0.00  1e-07
             128            32   float     sum    38.05    0.00    0.01  1e-07    38.00    0.00    0.01  1e-07
             256            64   float     sum    38.31    0.01    0.01  6e-08    38.73    0.01    0.01  6e-08
             512           128   float     sum    39.79    0.01    0.02  6e-08    39.00    0.01    0.02  6e-08
            1024           256   float     sum    40.40    0.03    0.05  2e-07    39.96    0.03    0.05  2e-07
            2048           512   float     sum    42.57    0.05    0.09  2e-07    42.42    0.05    0.09  2e-07
            4096          1024   float     sum    73.62    0.06    0.10  5e-07    72.72    0.06    0.11  5e-07
            8192          2048   float     sum    81.68    0.10    0.19  5e-07    80.06    0.10    0.19  5e-07
           16384          4096   float     sum    84.74    0.19    0.36  5e-07    83.30    0.20    0.37  5e-07
           32768          8192   float     sum    90.39    0.36    0.68  5e-07    90.26    0.36    0.68  5e-07
           65536         16384   float     sum    104.2    0.63    1.18  5e-07    102.9    0.64    1.19  5e-07
          131072         32768   float     sum    120.0    1.09    2.05  5e-07    118.6    1.11    2.07  5e-07
          262144         65536   float     sum    218.7    1.20    2.25  5e-07    221.3    1.18    2.22  5e-07
          524288        131072   float     sum    356.1    1.47    2.76  5e-07    355.5    1.47    2.77  5e-07
         1048576        262144   float     sum    479.5    2.19    4.10  5e-07    483.1    2.17    4.07  5e-07
         2097152        524288   float     sum    765.7    2.74    5.14  5e-07    764.2    2.74    5.15  5e-07
         4194304       1048576   float     sum   1428.6    2.94    5.50  5e-07   1425.0    2.94    5.52  5e-07
         8388608       2097152   float     sum   2776.9    3.02    5.66  5e-07   2764.4    3.03    5.69  5e-07
        16777216       4194304   float     sum   5475.1    3.06    5.75  5e-07   5490.5    3.06    5.73  5e-07
        33554432       8388608   float     sum    10886    3.08    5.78  5e-07    10876    3.09    5.78  5e-07
        67108864      16777216   float     sum    37080    1.81    3.39  5e-07    75304    0.89    1.67  5e-07
       134217728      33554432   float     sum    72090    1.86    3.49  5e-07    57255    2.34    4.40  5e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 1.92724
    

    but when the NCCL_NET_GDR_READ was 1, the nccl-tests outpus was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    43.22    0.00    0.00  2e-07    37.00    0.00    0.00  2e-07
              16             4   float     sum    37.34    0.00    0.00  2e-07    37.79    0.00    0.00  1e-07
              32             8   float     sum    37.33    0.00    0.00  1e-07    37.20    0.00    0.00  1e-07
              64            16   float     sum    37.89    0.00    0.00  1e-07    37.73    0.00    0.00  1e-07
             128            32   float     sum    38.61    0.00    0.01  1e-07    38.53    0.00    0.01  1e-07
             256            64   float     sum    43.42    0.01    0.01  6e-08    39.17    0.01    0.01  6e-08
             512           128   float     sum    40.46    0.01    0.02  6e-08    40.32    0.01    0.02  6e-08
            1024           256   float     sum    40.59    0.03    0.05  2e-07    40.28    0.03    0.05  2e-07
            2048           512   float     sum    43.55    0.05    0.09  2e-07    43.05    0.05    0.09  2e-07
            4096          1024   float     sum    73.49    0.06    0.10  5e-07    70.96    0.06    0.11  5e-07
            8192          2048   float     sum    79.89    0.10    0.19  5e-07    79.86    0.10    0.19  5e-07
           16384          4096   float     sum    84.63    0.19    0.36  5e-07    83.82    0.20    0.37  5e-07
           32768          8192   float     sum    93.38    0.35    0.66  5e-07    91.32    0.36    0.67  5e-07
           65536         16384   float     sum    107.4    0.61    1.14  5e-07    104.1    0.63    1.18  5e-07
          131072         32768   float     sum    122.9    1.07    2.00  5e-07    121.7    1.08    2.02  5e-07
          262144         65536   float     sum    225.9    1.16    2.18  5e-07    226.2    1.16    2.17  5e-07
          524288        131072   float     sum    346.8    1.51    2.83  5e-07    345.5    1.52    2.85  5e-07
         1048576        262144   float     sum    428.7    2.45    4.59  5e-07    430.0    2.44    4.57  5e-07
         2097152        524288   float     sum    576.1    3.64    6.83  5e-07    580.9    3.61    6.77  5e-07
         4194304       1048576   float     sum    927.3    4.52    8.48  5e-07    926.1    4.53    8.49  5e-07
         8388608       2097152   float     sum   1678.7    5.00    9.37  5e-07   1683.0    4.98    9.35  5e-07
        16777216       4194304   float     sum   3393.2    4.94    9.27  5e-07   3382.5    4.96    9.30  5e-07
        33554432       8388608   float     sum   7094.9    4.73    8.87  5e-07   7055.8    4.76    8.92  5e-07
        67108864      16777216   float     sum    16353    4.10    7.69  5e-07    16348    4.10    7.70  5e-07
       134217728      33554432   float     sum    32639    4.11    7.71  5e-07    32753    4.10    7.68  5e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 2.89958
    

    If I stop the nv_peer_mem service manualy by run the command: service nv_peer_mem stop,

    Then run the tests with NCCL_NET_GDR_READ=0, the result was:

                                                         out-of-place                       in-place
           size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
            (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    39.78    0.00    0.00  2e-07    38.16    0.00    0.00  2e-07
              16             4   float     sum    37.00    0.00    0.00  2e-07    37.33    0.00    0.00  1e-07
              32             8   float     sum    37.30    0.00    0.00  1e-07    37.08    0.00    0.00  1e-07
              64            16   float     sum    38.21    0.00    0.00  2e-07    38.90    0.00    0.00  2e-07
             128            32   float     sum    38.55    0.00    0.01  2e-07    38.87    0.00    0.01  2e-07
             256            64   float     sum    39.50    0.01    0.01  2e-07    39.42    0.01    0.01  2e-07
             512           128   float     sum    40.47    0.01    0.02  2e-07    39.91    0.01    0.02  2e-07
            1024           256   float     sum    41.05    0.02    0.05  2e-07    41.08    0.02    0.05  2e-07
            2048           512   float     sum    44.04    0.05    0.09  2e-07    43.84    0.05    0.09  2e-07
            4096          1024   float     sum    48.00    0.09    0.16  2e-07    47.30    0.09    0.16  2e-07
            8192          2048   float     sum    52.58    0.16    0.29  2e-07    51.76    0.16    0.30  2e-07
           16384          4096   float     sum    65.36    0.25    0.47  2e-07    64.10    0.26    0.48  2e-07
           32768          8192   float     sum    90.61    0.36    0.68  2e-07    87.10    0.38    0.71  2e-07
           65536         16384   float     sum    133.1    0.49    0.92  2e-07    258.5    0.25    0.48  2e-07
          131072         32768   float     sum    283.5    0.46    0.87  5e-07    277.1    0.47    0.89  5e-07
          262144         65536   float     sum    307.3    0.85    1.60  5e-07    300.6    0.87    1.63  5e-07
          524288        131072   float     sum    350.6    1.50    2.80  5e-07    353.6    1.48    2.78  5e-07
         1048576        262144   float     sum    475.0    2.21    4.14  5e-07    474.2    2.21    4.15  5e-07
         2097152        524288   float     sum    766.7    2.74    5.13  5e-07    762.5    2.75    5.16  5e-07
         4194304       1048576   float     sum   1453.1    2.89    5.41  5e-07   1451.9    2.89    5.42  5e-07
         8388608       2097152   float     sum   2980.8    2.81    5.28  5e-07   2984.1    2.81    5.27  5e-07
        16777216       4194304   float     sum    71226    0.24    0.44  5e-07   5877.2    2.85    5.35  5e-07
        33554432       8388608   float     sum    12570    2.67    5.01  2e-07    12543    2.68    5.02  2e-07
        67108864      16777216   float     sum    97148    0.69    1.30  2e-07    25695    2.61    4.90  2e-07
       134217728      33554432   float     sum    97671    1.37    2.58  2e-07    69526    1.93    3.62  2e-07
     Out of bounds values : 0 OK
     Avg bus bandwidth    : 1.67461
    

    So, this description that GDR did take effect.

    but the NCCL debug log always is [0] NCCL INFO Ring 00 : 15[41000] -> 0[1b000] [receive] via NET/IB/0

    opened by weberxie 25
  • GPU occupation during model training

    GPU occupation during model training

    Hi,

    Do you have any profiling result about GPU occupation during training ?

    Because I found that NCCL communication overhead arrived 75%, does it normal ?

    check_gpu_utilization

    Thanks

    opened by elevenxiang 23
  • peer mapping resources exhausted for < 8 GPUs

    peer mapping resources exhausted for < 8 GPUs

    I am running a NCCL reduction across multiple GPUs on an Amazon P2 16x instance in a multi-process context (one MPI rank per GPU). When I added small arrays together across 16 workers I got the error "peer mapping resources exhausted". Looking online I determined that perhaps I was limited to 8 GPUs in a group and NCCL wasn't dealing with this limitation internally.

    However, when I reduced between two groups of 8 GPUs using NCCL (by splitting MPI_COMM_WORLD into two separate communicators) and then did a standard MPI reduction in host memory to reduce the remaining two arrays, I got the same error. Same for 7 GPUs. I had to reduce the group size to 4 to get the correct behaviour.

    It seems this is unrelated to the peer ensemble limitation but instead is related to other resources needed for multi-process reductions on a single node.

    Joss Knight

    opened by extabgrad 23
  • NCCL segfaults on single node with 10 GPUs

    NCCL segfaults on single node with 10 GPUs

    I was attempting to use distributed tensorflow when I noticed I could not add the 10th gpu on my node to a distributed strategy... After running nccl-tests, I noticed it appears to be an issue with NCCL.

    [email protected]:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
    # nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid 226099 on node05-ccncluster device  0 [0x1a] TITAN Xp
    #   Rank  1 Pid 226099 on node05-ccncluster device  1 [0x1b] TITAN Xp
    #   Rank  2 Pid 226099 on node05-ccncluster device  2 [0x1c] TITAN Xp
    #   Rank  3 Pid 226099 on node05-ccncluster device  3 [0x1d] TITAN Xp
    #   Rank  4 Pid 226099 on node05-ccncluster device  4 [0x1e] TITAN Xp
    #   Rank  5 Pid 226099 on node05-ccncluster device  5 [0x3d] TITAN Xp
    #   Rank  6 Pid 226099 on node05-ccncluster device  6 [0x3e] TITAN Xp
    #   Rank  7 Pid 226099 on node05-ccncluster device  7 [0x3f] TITAN Xp
    #
    #                                                     out-of-place                       in-place
    #       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
               8             2   float     sum    42.86    0.00    0.00  1e-07    42.51    0.00    0.00  1e-07
              16             4   float     sum    42.46    0.00    0.00  1e-07    43.06    0.00    0.00  1e-07
              32             8   float     sum    42.90    0.00    0.00  6e-08    42.75    0.00    0.00  6e-08
              64            16   float     sum    42.81    0.00    0.00  6e-08    43.06    0.00    0.00  6e-08
             128            32   float     sum    42.81    0.00    0.01  6e-08    42.92    0.00    0.01  6e-08
             256            64   float     sum    43.05    0.01    0.01  3e-08    43.34    0.01    0.01  3e-08
             512           128   float     sum    42.79    0.01    0.02  3e-08    42.65    0.01    0.02  3e-08
            1024           256   float     sum    42.91    0.02    0.04  1e-07    43.00    0.02    0.04  1e-07
            2048           512   float     sum    43.35    0.05    0.08  2e-07    43.25    0.05    0.08  2e-07
            4096          1024   float     sum    43.46    0.09    0.16  2e-07    43.40    0.09    0.17  2e-07
            8192          2048   float     sum    44.38    0.18    0.32  2e-07    43.88    0.19    0.33  2e-07
           16384          4096   float     sum    49.15    0.33    0.58  2e-07    48.86    0.34    0.59  2e-07
           32768          8192   float     sum    72.44    0.45    0.79  2e-07    71.88    0.46    0.80  2e-07
           65536         16384   float     sum    120.5    0.54    0.95  2e-07    121.7    0.54    0.94  2e-07
          131072         32768   float     sum    129.5    1.01    1.77  2e-07    129.5    1.01    1.77  2e-07
          262144         65536   float     sum    157.1    1.67    2.92  2e-07    157.0    1.67    2.92  2e-07
          524288        131072   float     sum    205.4    2.55    4.47  2e-07    205.3    2.55    4.47  2e-07
         1048576        262144   float     sum    305.1    3.44    6.01  2e-07    305.0    3.44    6.02  2e-07
         2097152        524288   float     sum    647.4    3.24    5.67  2e-07    495.1    4.24    7.41  2e-07
         4194304       1048576   float     sum    900.7    4.66    8.15  2e-07    898.9    4.67    8.17  2e-07
         8388608       2097152   float     sum   1735.0    4.83    8.46  2e-07   1718.9    4.88    8.54  2e-07
        16777216       4194304   float     sum   3425.8    4.90    8.57  2e-07   3406.6    4.92    8.62  2e-07
        33554432       8388608   float     sum   6793.3    4.94    8.64  2e-07   6792.5    4.94    8.64  2e-07
        67108864      16777216   float     sum    13579    4.94    8.65  2e-07    13574    4.94    8.65  2e-07
       134217728      33554432   float     sum    27135    4.95    8.66  2e-07    27134    4.95    8.66  2e-07
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 3.0361
    #
    [email protected]:~/nccl-tests$  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
    # nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid 226138 on node05-ccncluster device  0 [0x1a] TITAN Xp
    #   Rank  1 Pid 226138 on node05-ccncluster device  1 [0x1b] TITAN Xp
    #   Rank  2 Pid 226138 on node05-ccncluster device  2 [0x1c] TITAN Xp
    #   Rank  3 Pid 226138 on node05-ccncluster device  3 [0x1d] TITAN Xp
    #   Rank  4 Pid 226138 on node05-ccncluster device  4 [0x1e] TITAN Xp
    #   Rank  5 Pid 226138 on node05-ccncluster device  5 [0x3d] TITAN Xp
    #   Rank  6 Pid 226138 on node05-ccncluster device  6 [0x3e] TITAN Xp
    #   Rank  7 Pid 226138 on node05-ccncluster device  7 [0x3f] TITAN Xp
    #   Rank  8 Pid 226138 on node05-ccncluster device  8 [0x40] TITAN Xp
    #   Rank  9 Pid 226138 on node05-ccncluster device  9 [0x41] TITAN Xp
    Segmentation fault (core dumped)
    [email protected]:~/nccl-tests$
    
    opened by mjlbach 22
  • Infiniband regression in NCCL 2.5.6-2

    Infiniband regression in NCCL 2.5.6-2

    I have an application that uses NCCL via PyTorch 1.4 compiled from source with NCCL 2.5.6-2 and crashes early with this error message on all/most ranks:

    NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.5.6

    Some of the ranks are printing out this error message:

    NCCL WARN Call to ibv_modify_qp failed with error No such device

    Running the same code on an image that is identical except that it was built with NCCL 2.4.8-1 runs fine. However, we are unable to use that version of NCCL for larger workloads because our nodes have multiple mellanox devices and we invariably run into https://github.com/NVIDIA/nccl/issues/179.

    Additional datapoints:

    • After startup, the application performs one allreduce operation across all ranks to detect issues which appears to complete successfully despite the warnings.
    • nccl-tests runs to successfully and without printing any warnings.

    As temporary workaround, I would also be interested in backporting the fix to https://github.com/NVIDIA/nccl/issues/179 to NCCL 2.4.8-1 but the referenced commit seems to incorporate a number of different changes. Is there a more minimal diff I could apply to get the fix?

    opened by cswinter 20
  • NCCL Fortran bindings

    NCCL Fortran bindings

    Hi All,

    I've added Fortran bindings for ncclBCast, ncclAllGather and ncclReduce. I plan on adding ncclAllReduce and ncclReduceScatter very soon.

    I've also added tests/samples for using these with floats using both "Fortran Array" and "Pointer" syntax. The tests/samples for ncclAllGather and ncclReduce are for "out-of-place" operation. I plan on adding "in-place" tests/samples soon too.

    Cheers, Kyle

    opened by kylefernandes 20
  • INT32 vs. FP16 performance on NCCL reduction

    INT32 vs. FP16 performance on NCCL reduction

    Hi there,

    I wanna ask about the performance comparison between int32 and fp16 datatype when using the allreduce API. I am not sure it's normal or not, but the int32 latency is almost 6x larger than fp16. It's kind of wired considering the bit differnece is only 16. Could you please give me some insights?

    Thanks : )

    image image

    opened by minghaoBD 1
  • fix NCCL_DEBUG_FILE behavior

    fix NCCL_DEBUG_FILE behavior

    NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it now sets ncclDebugLevel after parsing NCCL_DEBUG_FILE. Hence, NCCL_DEBUG_FILE will never be parsed.

    This patch moves parsing NCCL_DEBUG and settingtempNcclDebugLevel before parsing NCCL_DEBUG_FILE, and ensure NCCL_DEBUG_FILE is parsed only when NCCL_DEBUG > NCCL_LOG_VERSION (same as previous behavior, but now check tempNcclDebugLevel instead of ncclDebugLevel).

    opened by kingchc 1
  • API for querying group mode status (in/not in group mode)

    API for querying group mode status (in/not in group mode)

    Is it possible to add an API for querying whether we're in group mode? It comes handy in some scenarios

    With the latest commit there is only this global variable ncclGroupDepth https://github.com/NVIDIA/nccl/blob/master/src/group.cc#L13

    Thanks!

    opened by sitecao 0
  • RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms

    RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms

    Describe the bug Hi, @espnet team thanks for amazing work. I am running librispeech recipe with distributed mode using slurm on esonet2. i am running on two oracle instance each one has single gpu (Tesla V100). but when i ran stage 11 it created jobs on both machine and gpu memory is also utlized but it failed after sometime.

    Basic environments:

    • OS information: Ubuntu 18.04 x86_64
    • python version: python 3.9 [GCC 7.3.0]]
    • espnet version: latest
    • pytorch version 1.12.0
    • cuda 10.2

    Task information:

    • Task: ASR
    • Recipe: librispeech
    • ESPnet2

    To Reproduce when i ran the stage 11 with slurm it showing error after sometime...

    slurm.conf #Default configuration command sbatch --export=PATH option name=* --job-name $0 option time=* --time $0 option mem=* --mem-per-cpu $0 option mem=0 option num_threads=* --cpus-per-task $0 --ntasks-per-node=1 option num_threads=1 --cpus-per-task 12 --ntasks-per-node=1 option num_nodes=* --nodes $0 option gpu=1 -p tgpu option gpu=* -p tgpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU #note: the --max-jobs-run option is supported as a special case #by slurm.pl and you don't have to handle it in the config file. #default cpu=1

    $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST tgpu* up infinite 2 idle hp-[1-2]

    $ scontrol show nodes NodeName=hp-1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.34 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:1 NodeAddr=hp-1 NodeHostName=hp-1 Version=17.11 OS=Linux 5.4.0-1079-oracle #87~18.04.1-Ubuntu SMP Mon Jul 11 03:41:03 UTC 2022 RealMemory=1 AllocMem=0 FreeMem=86991 Sockets=12 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=tgpu BootTime=2022-07-24T06:57:55 SlurmdStartTime=2022-07-24T10:10:49 CfgTRES=cpu=12,mem=1M,billing=12 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

    NodeName=hp-2 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.09 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:1 NodeAddr=hp-2 NodeHostName=hp-2 Version=17.11 OS=Linux 5.4.0-1079-oracle #87~18.04.1-Ubuntu SMP Mon Jul 11 03:41:03 UTC 2022 RealMemory=1 AllocMem=0 FreeMem=86953 Sockets=12 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=tgpu BootTime=2022-07-24T07:00:18 SlurmdStartTime=2022-07-24T10:15:26 CfgTRES=cpu=12,mem=1M,billing=12 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

    GPU utilization Screenshot 2022-07-23 at 10 50 34 PM

    Error logs #Running on hp-1 #Started at Sat Jul 23 17:17:24 UTC 2022 #SLURMD_NODENAME=hp-1 #SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint #SLURM_CLUSTER_NAME=cluster #SLURM_CPUS_ON_NODE=12 #SLURM_CPUS_PER_TASK=12 #SLURM_EXPORT_ENV=PATH #SLURM_GET_USER_ENV=1 #SLURM_GTIDS=0 #SLURM_JOBID=70 #SLURM_JOB_CPUS_PER_NODE='12(x2)' #SLURM_JOB_GID=1001 #SLURM_JOB_ID=70 #SLURM_JOB_NAME=test #SLURM_JOB_NODELIST='hp-[1-2]' #SLURM_JOB_NUM_NODES=2 #SLURM_JOB_PARTITION=tgpu #SLURM_JOB_UID=1001 #SLURM_JOB_USER=ubuntu #SLURM_LOCALID=0 #SLURM_NNODES=2 #SLURM_NODEID=0 #SLURM_NODELIST='hp-[1-2]' #SLURM_NODE_ALIASES='(null)' #SLURM_NPROCS=2 #SLURM_NTASKS=2 #SLURM_NTASKS_PER_NODE=1 #SLURM_OPEN_MODE=a #SLURM_PRIO_PROCESS=0 #SLURM_PROCID=0 #SLURM_SUBMIT_DIR=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1 #SLURM_SUBMIT_HOST=hp-1 #SLURM_TASKS_PER_NODE='1(x2)' #SLURM_TASK_PID=28524 #SLURM_TOPOLOGY_ADDR=hp-1 #SLURM_TOPOLOGY_ADDR_PATTERN=node #SLURM_WORKING_CLUSTER=cluster:155.248.167.102:6817:8192 #srun --export=ALL srun -N2 python3 -m espnet2.bin.asr_train --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90 /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90 /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90 /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90 /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90 WARNING:root:Using legacy_rel_pos and it will be deprecated in the future. WARNING:root:Using legacy_rel_pos and it will be deprecated in the future. WARNING:root:Using legacy_rel_pos and it will be deprecated in the future. WARNING:root:Using legacy_rel_pos and it will be deprecated in the future. WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future. WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future. WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future. WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future. hp-1:28603:28603 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.27<0> hp-1:28603:28603 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

    hp-1:28603:28603 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] hp-1:28603:28603 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.27<0> hp-1:28603:28603 [0] NCCL INFO Using network Socket NCCL version 2.10.3+cuda10.2 hp-1:28608:28608 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.27<0> hp-1:28608:28608 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

    hp-1:28608:28608 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] hp-1:28608:28608 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.27<0> hp-1:28608:28608 [0] NCCL INFO Using network Socket NCCL version 2.10.3+cuda10.2 Traceback (most recent call last): File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in main() File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main ASRTask.main(cmd=cmd) File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main cls.main_worker(args) File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1309, in main_worker cls.trainer.run( File "/home/ubuntu/users/himanshu/espnet/espnet2/train/trainer.py", line 220, in run dp_model = torch.nn.parallel.DistributedDataParallel( File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms Exception raised from get at ../torch/csrc/distributed/c10d/FileStore.cpp:362 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f03a5bba612 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f03a5bb6cab in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10d::FileStore::get(std::string const&) + 0xb09 (0x7f03da1ce739 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f03da1d13c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f03da1d13c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb1 (0x7f03a6ffa301 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x204 (0x7f03a6ffe794 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) + 0x34b (0x7f03a700c7db in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #8: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::optional<std::weak_ptrc10d::Logger > const&) + 0x3f5 (0x7f03da21b825 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #9: + 0x87cebc (0x7f03ef97debc in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so) frame #10: + 0x21ebc5 (0x7f03ef31fbc5 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so) frame #11: + 0x1828f4 (0x55f7867078f4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #12: _PyObject_MakeTpCall + 0x2df (0x55f7866c147f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #13: _PyEval_EvalFrameDefault + 0x49a9 (0x55f78675f2e9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #14: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #15: _PyFunction_Vectorcall + 0x1d4 (0x55f78671ccb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #16: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #17: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #18: _PyFunction_Vectorcall + 0x244 (0x55f78671cd24 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #19: _PyObject_FastCallDictTstate + 0xee (0x55f786707a2e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #20: + 0x18c429 (0x55f786711429 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #21: _PyObject_MakeTpCall + 0x38f (0x55f7866c152f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #22: _PyEval_EvalFrameDefault + 0x1350 (0x55f78675bc90 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #23: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #24: + 0x198709 (0x55f78671d709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #25: + 0xfe73d (0x55f78668373d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #26: + 0x198559 (0x55f78671d559 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #27: + 0xff300 (0x55f786684300 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #28: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #29: + 0x198709 (0x55f78671d709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #30: + 0xfe73d (0x55f78668373d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #31: + 0x231418 (0x55f7867b6418 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #32: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #33: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #34: PyEval_EvalCodeEx + 0x4c (0x55f7867c8a7c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #35: PyEval_EvalCode + 0x1b (0x55f78671cdbb in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #36: + 0x27a33e (0x55f7867ff33e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #37: + 0x1a1571 (0x55f786726571 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #38: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #39: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #40: _PyFunction_Vectorcall + 0x1d4 (0x55f78671ccb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #41: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #42: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #43: _PyFunction_Vectorcall + 0x1d4 (0x55f78671ccb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #44: _PyObject_Call + 0x1da (0x55f7866cb30a in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #45: + 0x274eaa (0x55f7867f9eaa in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #46: Py_RunMain + 0x18f (0x55f7867fec0f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #47: Py_BytesMain + 0x39 (0x55f7867feff9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #48: __libc_start_main + 0xe7 (0x7f0416b22c87 in /lib/x86_64-linux-gnu/libc.so.6) frame #49: + 0x2016a0 (0x55f7867866a0 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)

    Traceback (most recent call last): File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in main() File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main ASRTask.main(cmd=cmd) File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main cls.main_worker(args) File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1309, in main_worker cls.trainer.run( File "/home/ubuntu/users/himanshu/espnet/espnet2/train/trainer.py", line 220, in run dp_model = torch.nn.parallel.DistributedDataParallel( File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms Exception raised from get at ../torch/csrc/distributed/c10d/FileStore.cpp:362 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa47e37a612 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fa47e376cab in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10d::FileStore::get(std::string const&) + 0xb09 (0x7fa4b298e739 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fa4b29913c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fa4b29913c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb1 (0x7fa47f7ba301 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x204 (0x7fa47f7be794 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) + 0x34b (0x7fa47f7cc7db in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #8: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::optional<std::weak_ptrc10d::Logger > const&) + 0x3f5 (0x7fa4b29db825 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) frame #9: + 0x87cebc (0x7fa4c813debc in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so) frame #10: + 0x21ebc5 (0x7fa4c7adfbc5 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so) frame #11: + 0x1828f4 (0x559e091508f4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #12: _PyObject_MakeTpCall + 0x2df (0x559e0910a47f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #13: _PyEval_EvalFrameDefault + 0x49a9 (0x559e091a82e9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #14: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #15: _PyFunction_Vectorcall + 0x1d4 (0x559e09165cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #16: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #17: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #18: _PyFunction_Vectorcall + 0x244 (0x559e09165d24 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #19: _PyObject_FastCallDictTstate + 0xee (0x559e09150a2e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #20: + 0x18c429 (0x559e0915a429 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #21: _PyObject_MakeTpCall + 0x38f (0x559e0910a52f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #22: _PyEval_EvalFrameDefault + 0x1350 (0x559e091a4c90 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #23: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #24: + 0x198709 (0x559e09166709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #25: + 0xfe73d (0x559e090cc73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #26: + 0x198559 (0x559e09166559 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #27: + 0xff300 (0x559e090cd300 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #28: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #29: + 0x198709 (0x559e09166709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #30: + 0xfe73d (0x559e090cc73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #31: + 0x231418 (0x559e091ff418 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #32: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #33: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #34: PyEval_EvalCodeEx + 0x4c (0x559e09211a7c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #35: PyEval_EvalCode + 0x1b (0x559e09165dbb in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #36: + 0x27a33e (0x559e0924833e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #37: + 0x1a1571 (0x559e0916f571 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #38: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #39: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #40: _PyFunction_Vectorcall + 0x1d4 (0x559e09165cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #41: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #42: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #43: _PyFunction_Vectorcall + 0x1d4 (0x559e09165cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #44: _PyObject_Call + 0x1da (0x559e0911430a in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #45: + 0x274eaa (0x559e09242eaa in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #46: Py_RunMain + 0x18f (0x559e09247c0f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #47: Py_BytesMain + 0x39 (0x559e09247ff9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3) frame #48: __libc_start_main + 0xe7 (0x7fa4ef2e2c87 in /lib/x86_64-linux-gnu/libc.so.6) frame #49: + 0x2016a0 (0x559e091cf6a0 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)

    srun: error: hp-2: task 1: Exited with exit code 1 srun: error: hp-2: task 1: Exited with exit code 1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: got SIGCONT srun: forcing job termination srun: got SIGCONT srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: got SIGCONT slurmstepd-hp-1: error: *** STEP 70.2 ON hp-1 CANCELLED AT 2022-07-23T18:02:02 *** srun: forcing job termination slurmstepd-hp-1: error: *** STEP 70.1 ON hp-1 CANCELLED AT 2022-07-23T18:02:02 *** slurmstepd-hp-1: error: *** STEP 70.0 ON hp-1 CANCELLED AT 2022-07-23T18:02:02 *** srun: forcing job termination

    opened by himanshucodz55 1
  • A100 GDRDMA failed. transport/net_ib.cc:818 Got completion with error 4, opcode 8983, len 0, vendor err 81

    A100 GDRDMA failed. transport/net_ib.cc:818 Got completion with error 4, opcode 8983, len 0, vendor err 81

    I try to run this command to test GDRDMA locally in order to use it in multi-node training in the future, but failed. Could you please give me some suggestions?

    version_info: NCCL version 2.7.6+cuda11.1

    NCCL_IB_HCA=mlx5 CUDA_VISIBLE_DEVICES=4,5,6,7 NCCL_DEBUG=INFO NCCL_NET_GDR_LEVEL=4 NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1 NCCL_CROSS_NIC=1 ./build/all_reduce_perf -b 8 -e 1G -f 2 -g 4
    

    The topology is shown as below, and we use GPU[4 to 7] mlx5[8 to 15] for test case.

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  mlx5_10 mlx5_11 mlx5_12 mlx5_13 mlx5_14 mlx5_15 mlx5_16 mlx5_17 CPU Affinity   NUMA Affinity
    GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    NODE    NODE    NODE    NODE    PXB     PXB     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143    0
    GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    NODE    NODE    NODE    NODE    PXB     PXB     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143    0
    GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     PXB     PXB     PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143    0
    GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     PXB     PXB     PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143    0
    GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     PXB     PXB     PXB     NODE    NODE    48-95,144-191  1
    GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     PXB     PXB     PXB     NODE    NODE    48-95,144-191  1
    GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     PXB     PXB     NODE    NODE    NODE    NODE    NODE    NODE    48-95,144-191  1
    GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     PXB     PXB     NODE    NODE    NODE    NODE    NODE    NODE    48-95,144-191  1
    mlx5_0  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS      X      PIX     PIX     PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_1  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     PIX      X      PIX     PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_2  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     PIX     PIX      X      PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_3  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     PIX     PIX     PIX      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_4  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      PIX     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_5  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX      X      PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_6  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     PXB      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_7  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     PXB     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
    mlx5_8  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PXB     PXB     NODE    NODE    NODE    NODE    NODE    NODE
    mlx5_9  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PXB     PXB     NODE    NODE    NODE    NODE    NODE    NODE
    mlx5_10 SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB      X      PIX     NODE    NODE    NODE    NODE    NODE    NODE
    mlx5_11 SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     PIX      X      NODE    NODE    NODE    NODE    NODE    NODE
    mlx5_12 SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      PIX     PXB     PXB     NODE    NODE
    mlx5_13 SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX      X      PXB     PXB     NODE    NODE
    mlx5_14 SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     PXB      X      PIX     NODE    NODE
    mlx5_15 SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     PXB     PIX      X      NODE    NODE
    mlx5_16 SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE     X      PIX
    mlx5_17 SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    PIX      X
    
    

    Here is the NCCL log we got.

    # nThread 1 nGpus 4 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
    #
    # Using devices
    #   Rank  0 Pid 120890 on 10-90-43-152 device  0 [0x87] A100-SXM4-40GB
    #   Rank  1 Pid 120890 on 10-90-43-152 device  1 [0x8d] A100-SXM4-40GB
    #   Rank  2 Pid 120890 on 10-90-43-152 device  2 [0xc7] A100-SXM4-40GB
    #   Rank  3 Pid 120890 on 10-90-43-152 device  3 [0xca] A100-SXM4-40GB
    10-90-43-152:120890:120890 [0] NCCL INFO Bootstrap : Using [0]ib8:192.168.1.8<0> [1]ib9:192.168.1.9<0> [2]ib10:192.168.1.10<0> [3]ib11:192.168.1.11<0> [4]ib12:192.168.1.12<0> [5]ib14:192.168.1.14<0> [6]ib15:192.168.1.15<0>
    10-90-43-152:120890:120890 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
    10-90-43-152:120890:120890 [0] NCCL INFO NET/IB : Using [0]mlx5_8:1/IB [1]mlx5_9:1/IB [2]mlx5_10:1/IB [3]mlx5_11:1/IB [4]mlx5_12:1/IB [5]mlx5_14:1/IB [6]mlx5_15:1/IB [7]mlx5_16:1/RoCE [8]mlx5_17:1/RoCE ; OOB ib8:192.168.1.8<0>
    10-90-43-152:120890:120890 [0] NCCL INFO Using network IB
    NCCL version 2.7.6+cuda11.1
    10-90-43-152:120890:121560 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
    10-90-43-152:120890:121560 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 1.
    10-90-43-152:120890:121560 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS
    10-90-43-152:120890:121560 [0] NCCL INFO NCCL_CROSS_NIC set by environment to 1.
    10-90-43-152:120890:121563 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
    10-90-43-152:120890:121563 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1 [2] 2/0/-1->3->1|1->3->2/0/-1 [3] 2/0/-1->3->1|1->3->2/0/-1
    10-90-43-152:120890:121563 [3] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 00/04 :    0   1   2   3
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 01/04 :    0   1   2   3
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 02/04 :    0   1   2   3
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 03/04 :    0   1   2   3
    10-90-43-152:120890:121561 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
    10-90-43-152:120890:121562 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
    10-90-43-152:120890:121562 [2] NCCL INFO Trees [0] 1/3/-1->2->0|0->2->1/3/-1 [1] 1/3/-1->2->0|0->2->1/3/-1 [2] -1/-1/-1->2->3|3->2->-1/-1/-1 [3] -1/-1/-1->2->3|3->2->-1/-1/-1
    10-90-43-152:120890:121561 [1] NCCL INFO Trees [0] -1/-1/-1->1->2|2->1->-1/-1/-1 [1] -1/-1/-1->1->2|2->1->-1/-1/-1 [2] 3/-1/-1->1->-1|-1->1->3/-1/-1 [3] 3/-1/-1->1->-1|-1->1->3/-1/-1
    10-90-43-152:120890:121561 [1] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
    10-90-43-152:120890:121562 [2] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
    10-90-43-152:120890:121560 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
    10-90-43-152:120890:121560 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1|-1->0->2/-1/-1 [1] 2/-1/-1->0->-1|-1->0->2/-1/-1 [2] -1/-1/-1->0->3|3->0->-1/-1/-1 [3] -1/-1/-1->0->3|3->0->-1/-1/-1
    10-90-43-152:120890:121560 [0] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 00 : 2[c7000] -> 3[ca000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 00 : 1[8d000] -> 2[c7000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 00 : 0[87000] -> 1[8d000] [receive] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 00 : 3[ca000] -> 0[87000] [receive] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 00 : 3[ca000] -> 0[87000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 00 : 2[c7000] -> 3[ca000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 00 : 1[8d000] -> 2[c7000] [send] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 00 : 0[87000] -> 1[8d000] [send] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 00 : 2[c7000] -> 1[8d000] [receive] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 00 : 2[c7000] -> 0[87000] [receive] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 00 : 3[ca000] -> 2[c7000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 00 : 2[c7000] -> 0[87000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 00 : 3[ca000] -> 2[c7000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 00 : 0[87000] -> 2[c7000] [send] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 01 : 2[c7000] -> 3[ca000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 01 : 3[ca000] -> 0[87000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 00 : 0[87000] -> 2[c7000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 00 : 2[c7000] -> 1[8d000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 01 : 3[ca000] -> 0[87000] [receive] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 01 : 0[87000] -> 1[8d000] [send] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 01 : 0[87000] -> 1[8d000] [receive] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 01 : 1[8d000] -> 2[c7000] [send] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 01 : 1[8d000] -> 2[c7000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 01 : 2[c7000] -> 3[ca000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 01 : 2[c7000] -> 0[87000] [receive] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 01 : 2[c7000] -> 1[8d000] [receive] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 01 : 3[ca000] -> 2[c7000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 01 : 3[ca000] -> 2[c7000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 01 : 2[c7000] -> 0[87000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 01 : 0[87000] -> 2[c7000] [send] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 02 : 2[c7000] -> 3[ca000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 01 : 0[87000] -> 2[c7000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 02 : 3[ca000] -> 0[87000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 01 : 2[c7000] -> 1[8d000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 02 : 3[ca000] -> 0[87000] [receive] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 02 : 0[87000] -> 1[8d000] [send] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 02 : 0[87000] -> 1[8d000] [receive] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 02 : 1[8d000] -> 2[c7000] [send] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 02 : 1[8d000] -> 2[c7000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 02 : 2[c7000] -> 3[ca000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 02 : 0[87000] -> 3[ca000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 02 : 3[ca000] -> 1[8d000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 02 : 3[ca000] -> 1[8d000] [receive] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 02 : 3[ca000] -> 2[c7000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 02 : 0[87000] -> 3[ca000] [send] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 03 : 3[ca000] -> 0[87000] [receive] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 03 : 0[87000] -> 1[8d000] [send] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 02 : 1[8d000] -> 3[ca000] [send] via NET/IB/4/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 02 : 1[8d000] -> 3[ca000] [receive] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 02 : 3[ca000] -> 2[c7000] [send] via NET/IB/0/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 03 : 0[87000] -> 1[8d000] [receive] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 03 : 1[8d000] -> 2[c7000] [send] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 03 : 1[8d000] -> 2[c7000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 03 : 2[c7000] -> 3[ca000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 03 : 2[c7000] -> 3[ca000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 03 : 3[ca000] -> 0[87000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 03 : 3[ca000] -> 1[8d000] [receive] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO Channel 03 : 0[87000] -> 3[ca000] [send] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 03 : 0[87000] -> 3[ca000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO Channel 03 : 3[ca000] -> 2[c7000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 03 : 3[ca000] -> 1[8d000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121560 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
    10-90-43-152:120890:121560 [0] NCCL INFO comm 0x7ff380b2bd60 rank 0 nranks 4 cudaDev 0 busId 87000 - Init COMPLETE
    10-90-43-152:120890:121561 [1] NCCL INFO Channel 03 : 1[8d000] -> 3[ca000] [send] via NET/IB/5/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 03 : 1[8d000] -> 3[ca000] [receive] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121563 [3] NCCL INFO Channel 03 : 3[ca000] -> 2[c7000] [send] via NET/IB/2/GDRDMA
    10-90-43-152:120890:121562 [2] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
    10-90-43-152:120890:121561 [1] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
    10-90-43-152:120890:121562 [2] NCCL INFO comm 0x7ff378b2bdd0 rank 2 nranks 4 cudaDev 2 busId c7000 - Init COMPLETE
    10-90-43-152:120890:121561 [1] NCCL INFO comm 0x7ff384b2bd60 rank 1 nranks 4 cudaDev 1 busId 8d000 - Init COMPLETE
    10-90-43-152:120890:121563 [3] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
    10-90-43-152:120890:121563 [3] NCCL INFO comm 0x7ff38c000dc0 rank 3 nranks 4 cudaDev 3 busId ca000 - Init COMPLETE
    #
    #                                                       out-of-place                       in-place
    #       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    #        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    10-90-43-152:120890:120890 [0] NCCL INFO Launch mode Group/CGMD
    
    10-90-43-152:120890:121587 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 0, len 32759, vendor err 81
    10-90-43-152:120890:121587 [0] NCCL INFO include/net.h:28 -> 2
    10-90-43-152:120890:121587 [0] NCCL INFO transport/net.cc:317 -> 2
    10-90-43-152:120890:121587 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
    
    10-90-43-152:120890:121585 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 21394, len 0, vendor err 81
    10-90-43-152:120890:121585 [0] NCCL INFO include/net.h:28 -> 2
    10-90-43-152:120890:121585 [0] NCCL INFO transport/net.cc:317 -> 2
    10-90-43-152:120890:121585 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
    
    10-90-43-152:120890:121586 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 9022, len 0, vendor err 81
    10-90-43-152:120890:121586 [0] NCCL INFO include/net.h:28 -> 2
    10-90-43-152:120890:121586 [0] NCCL INFO transport/net.cc:317 -> 2
    10-90-43-152:120890:121586 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
    10-90-43-152: Test NCCL failure common.cu:499 'unhandled system error'
     .. 10-90-43-152 pid 120890: Test failure common.cu:587
     .. 10-90-43-152 pid 120890: Test failure common.cu:766
     .. 10-90-43-152 pid 120890: Test failure all_reduce.cu:103
     .. 10-90-43-152 pid 120890: Test failure common.cu:792
     .. 10-90-43-152 pid 120890: Test failure common.cu:1166
     .. 10-90-43-152 pid 120890: Test failure common.cu:1007
    
    
    opened by ConnollyLeon 1
  • NCCL hangs after multiple training steps due to network connection

    NCCL hangs after multiple training steps due to network connection

    Running horovod with nccl on a k8s cluster with 32 Pods each with 6 V100 GPUs, nccl threads halt in persistentSocketThread due to failure in performing recv:

    [15]worker-23:83:2526 [0] misc/socket.cc:503 NCCL WARN Net : Call to recv from 100.96.188.199<41468> failed : Connection timed out
    [15]worker-23:83:2526 [0] NCCL INFO misc/socket.cc:520 -> 2
    [15]23:83:2526 [0] transport/net_socket.cc:219 NCCL WARN NET/Socket : socket progress error
    [15]worker-23:83:2520 [0] NCCL INFO include/net.h:32 -> 2
    [15]worker-23:83:2520 [0] NCCL INFO transport/net.cc:870 -> 2
    [15]worker-23:83:2520 [0] NCCL INFO proxy.cc:494 -> 2
    [15]worker-23:83:2520 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread]
    

    All of the ranks have this stack trace:

    #0  0x00007f7a2620c965 in [email protected]@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
    #1  0x00007f795f0e179b in persistentSocketThread (args_=0x7f61ac040c10) at transport/net_socket.cc:231
    #2  0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
    #3  0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6
    

    and the recv trace:

    #0  0x00007f7a2620fa8b in recv () from /usr/lib64/libpthread.so.0
    #1  0x00007f795f0d6f40 in ncclSocketProgressOpt (block=0, closed=<synthetic pointer>, offset=0x7f6e30ff8cc0, size=4, ptr=0x7f6e30ff8cb0, sock=0x7f61ac02f670, op=1) at misc/socket.cc:495
    #2  ncclSocketProgress (op=1, sock=0x7f61ac02f670, [email protected]=0x7f6e30ff8cb0, [email protected]=4, [email protected]=0x7f6e30ff8cc0) at misc/socket.cc:520
    #3  0x00007f795f0e388f in ncclSocketTest (request=0x7f61ac0302b8, done=0x7f6e30ff91b0, size=0x7f6e30ff9230) at transport/net_socket.cc:493
    #4  0x00007f795f0ddf2d in ncclNetTest (sizes=0x7f6e30ff9230, done=0x7f6e30ff91b0, request=<optimized out>) at include/net.h:32
    #5  recvProxyProgress (comm=<optimized out>, args=0x7f61a4037fc8) at transport/net.cc:996
    #6  0x00007f795f0cc933 in progressOps (idle=<synthetic pointer>, opStart=<optimized out>, state=0x7f795825b4b8, comm=0x7f7958256ff0) at proxy.cc:494
    #7  ncclProxyProgress (comm_=0x7f7958256ff0) at proxy.cc:611
    #8  0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
    #9  0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6
    

    This is while the GPU utilization and GPU memory utilization stays nearly at 100% while no training is going on:

    $ nvidia-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
    | N/A   48C    P0    47W / 250W |  30765MiB / 32510MiB |    100%   E. Process |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-PCIE...  On   | 00000000:60:00.0 Off |                    0 |
    | N/A   38C    P0    40W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-PCIE...  On   | 00000000:61:00.0 Off |                    0 |
    | N/A   40C    P0    41W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                    0 |
    | N/A   52C    P0    43W / 250W |  30765MiB / 32510MiB |    100%   E. Process |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   4  Tesla V100-PCIE...  On   | 00000000:DA:00.0 Off |                    0 |
    | N/A   42C    P0    43W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   5  Tesla V100-PCIE...  On   | 00000000:DB:00.0 Off |                    0 |
    | N/A   45C    P0    43W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    

    We are tuning the number of sockets and socket threads by setting NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD. The issue mostly happens when we set them to 16 and 2 respectively.

    opened by MrAta 1
Releases(v1.3.4-1)
Owner
NVIDIA Corporation
NVIDIA Corporation
Thread pool - Thread pool using std::* primitives from C++17, with optional priority queue/greenthreading for POSIX.

thread_pool Thread pool using std::* primitives from C++11. Also includes a class for a priority thread pool. Requires concepts and C++17, including c

Tyler Hardin 75 Jul 13, 2022
A fast multi-producer, multi-consumer lock-free concurrent queue for C++11

moodycamel::ConcurrentQueue An industrial-strength lock-free queue for C++. Note: If all you need is a single-producer, single-consumer queue, I have

Cameron 7k Aug 6, 2022
A bounded multi-producer multi-consumer concurrent queue written in C++11

MPMCQueue.h A bounded multi-producer multi-consumer concurrent queue written in C++11. It's battle hardened and used daily in production: In the Frost

Erik Rigtorp 745 Aug 4, 2022
C++11 thread safe, multi-producer, multi-consumer blocking queue, stack & priority queue class

BlockingCollection BlockingCollection is a C++11 thread safe collection class that provides the following features: Modeled after .NET BlockingCollect

Code Ex Machina LLC 45 Jul 13, 2022
This is a study on how to do create a queue via IPC (inter-process communication)

IPC queue This is a study on how to do create a queue via IPC (inter-process communication). Please take a look at the examples of producer and consum

Tarcísio Zotelli Ferraz 2 Jan 10, 2022
Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common

null 355 Jun 27, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Parallella 296 Jul 24, 2022
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 3.9k Jul 31, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Boost.org 1.3k Jul 29, 2022
OpenCL based GPU accelerated SPH fluid simulation library

libclsph An OpenCL based GPU accelerated SPH fluid simulation library Can I see it in action? Demo #1 Demo #2 Why? Libclsph was created to explore the

null 47 Jul 27, 2022
Patterns and behaviors for GPU computing

moderngpu 2.0 (c) 2016 Sean Baxter You can drop me a line here Full documentation with github wiki under heavy construction. Latest update: 2.12 2016

null 1.3k Jul 29, 2022
stdgpu: Efficient STL-like Data Structures on the GPU

stdgpu: Efficient STL-like Data Structures on the GPU Features | Examples | Documentation | Building | Integration | Contributing | License | Contact

Patrick Stotko 720 Aug 8, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Boost.org 1.3k Aug 7, 2022
ParallelComputingPlayground - Shows different programming techniques for parallel computing on CPU and GPU

ParallelComputingPlayground Shows different programming techniques for parallel computing on CPU and GPU. Purpose The idea here is to compute a Mandel

Morten Nobel-Jørgensen 2 May 16, 2020
A library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies.

Fiber Tasking Lib This is a library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies. Dependenc

RichieSams 774 Jul 27, 2022
Powerful multi-threaded coroutine dispatcher and parallel execution engine

Quantum Library : A scalable C++ coroutine framework Quantum is a full-featured and powerful C++ framework build on top of the Boost coroutine library

Bloomberg 447 Jul 25, 2022
lc is a fast multi-threaded line counter.

Fast multi-threaded line counter in Modern C++ (2-10x faster than `wc -l` for large files)

Pranav 13 Jul 28, 2022
A library OS for Linux multi-process applications, with Intel SGX support

Graphene Library OS with Intel SGX Support A Linux-compatible Library OS for Multi-Process Applications NOTE: We are in the middle of transitioning ou

The Gramine Project 238 Jul 29, 2022
Multi-backend implementation of SYCL for CPUs and GPUs

hipSYCL - a SYCL implementation for CPUs and GPUs hipSYCL is a modern SYCL implementation targeting CPUs and GPUs, with a focus on leveraging existing

Aksel Alpay 563 Jul 28, 2022