HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)

Overview

logo Merlin: HugeCTR

v30

HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs). HugeCTR supports model-parallel embedding tables and data-parallel neural networks and their variants such as Deep Interest Network (DIN), NCF, Wide and Deep Learning (WDL), Deep Cross Network (DCN), DeepFM, and Deep Learning Recommendation Model (DLRM). HugeCTR is a component of NVIDIA Merlin Open Beta, which is used to build large-scale deep learning recommender systems. For more information, refer to HugeCTR User Guide.

Design Goals:

  • Fast: HugeCTR is a speed-of-light CTR model framework that can outperform popular recommender systems such as TensorFlow (TF).
  • Efficient: HugeCTR provides the essentials so that you can efficiently train your CTR model.
  • Easy: Regardless of whether you are a data scientist or machine learning practitioner, we've made it easy for anybody to use HugeCTR.

Table of Contents

Core Features

HugeCTR supports a variety of features, including the following:

To learn about our latest enhancements, refer to our release notes.

Getting Started

If you'd like to quickly train a model using the Python interface, do the following:

  1. Start a NGC container with your local host directory (/your/host/dir mounted) by running the following command:

    docker run --gpus=all --rm -it --cap-add SYS_NICE -v /your/host/dir:/your/container/dir -w /your/container/dir -it -u $(id -u):$(id -g) nvcr.io/nvidia/merlin/merlin-training:22.01
    

    NOTE: The /your/host/dir directory is just as visible as the /your/container/dir directory. The /your/host/dir directory is also your starting directory.

  2. Write a simple Python script to generate a synthetic dataset:

    # dcn_norm_generate.py
    import hugectr
    from hugectr.tools import DataGeneratorParams, DataGenerator
    data_generator_params = DataGeneratorParams(
      format = hugectr.DataReaderType_t.Norm,
      label_dim = 1,
      dense_dim = 13,
      num_slot = 26,
      i64_input_key = False,
      source = "./dcn_norm/file_list.txt",
      eval_source = "./dcn_norm/file_list_test.txt",
      slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120, 
      1543],
      check_type = hugectr.Check_t.Sum,
      dist_type = hugectr.Distribution_t.PowerLaw,
      power_law_type = hugectr.PowerLaw_t.Short)
    data_generator = DataGenerator(data_generator_params)
    data_generator.generate()
    
  3. Generate the Norm dataset for your DCN model by running the following command:

    python dcn_norm_generate.py
    

    NOTE: The generated dataset will reside in the folder ./dcn_norm, which contains training and evaluation data.

  4. Write a simple Python script for training:

    # dcn_norm_train.py
    import hugectr
    from mpi4py import MPI
    solver = hugectr.CreateSolver(max_eval_batches = 1280,
                                  batchsize_eval = 1024,
                                  batchsize = 1024,
                                  lr = 0.001,
                                  vvgpu = [[0]],
                                  repeat_dataset = True)
    reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                                     source = ["./dcn_norm/file_list.txt"],
                                     eval_source = "./dcn_norm/file_list_test.txt",
                                     check_type = hugectr.Check_t.Sum)
    optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam,
                                        update_type = hugectr.Update_t.Global)
    model = hugectr.Model(solver, reader, optimizer)
    model.add(hugectr.Input(label_dim = 1, label_name = "label",
                            dense_dim = 13, dense_name = "dense",
                            data_reader_sparse_param_array = 
                            [hugectr.DataReaderSparseParam("data1", 1, True, 26)]))
    model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                               workspace_size_per_gpu_in_mb = 25,
                               embedding_vec_size = 16,
                               combiner = "sum",
                               sparse_embedding_name = "sparse_embedding1",
                               bottom_name = "data1",
                               optimizer = optimizer))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                               bottom_names = ["sparse_embedding1"],
                               top_names = ["reshape1"],
                               leading_dim=416))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                               bottom_names = ["reshape1", "dense"], top_names = ["concat1"]))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCross,
                               bottom_names = ["concat1"],
                               top_names = ["multicross1"],
                               num_layers=6))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                               bottom_names = ["concat1"],
                               top_names = ["fc1"],
                               num_output=1024))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                               bottom_names = ["fc1"],
                               top_names = ["relu1"]))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                               bottom_names = ["relu1"],
                               top_names = ["dropout1"],
                               dropout_rate=0.5))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                               bottom_names = ["dropout1", "multicross1"],
                               top_names = ["concat2"]))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                               bottom_names = ["concat2"],
                               top_names = ["fc2"],
                               num_output=1))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                               bottom_names = ["fc2", "label"],
                               top_names = ["loss"]))
    model.compile()
    model.summary()
    model.graph_to_json(graph_config_file = "dcn.json")
    model.fit(max_iter = 5120, display = 200, eval_interval = 1000, snapshot = 5000, snapshot_prefix = "dcn")
    

    NOTE: Ensure that the paths to the synthetic datasets are correct with respect to this Python script. data_reader_type, check_type, label_dim, dense_dim, and data_reader_sparse_param_array should be consistent with the generated dataset.

  5. Train the model by running the following command:

    python dcn_norm_train.py
    

    NOTE: It is presumed that the evaluation AUC value is incorrect since randomly generated datasets are being used. When the training is done, files that contain the dumped graph JSON, saved model weights, and optimizer states will be generated.

For more information, refer to the HugeCTR User Guide.

HugeCTR SDK

We're able to support external developers who can't use HugeCTR directly by exporting important HugeCTR components using:

  • Sparse Operation Kit: a python package wrapped with GPU accelerated operations dedicated for sparse training/inference cases.
  • GPU Embedding Cache: embedding cache available on the GPU memory designed for CTR inference workload.

Support and Feedback

If you encounter any issues or have questions, go to https://github.com/NVIDIA/HugeCTR/issues and submit an issue so that we can provide you with the necessary resolutions and answers. To further advance the HugeCTR Roadmap, we encourage you to share all the details regarding your recommender system pipeline using this survey.

Contributing to HugeCTR

With HugeCTR being an open source project, we welcome contributions from the general public. With your contributions, we can continue to improve HugeCTR's quality and performance. To learn how to contribute, refer to our HugeCTR Contributor Guide.

Additional Resources

Webpages
NVIDIA Merlin
NVIDIA HugeCTR

Talks

Conference / Website Title Date Speaker Language
APSARA 2021 GPU 推荐系统 Merlin Oct 2021 Joey Wang 中文
GTC Spring 2021 Learn how Tencent Deployed an Advertising System on the Merlin GPU Recommender Framework April 2021 Xiangting Kong, Joey Wang English
GTC Spring 2021 Merlin HugeCTR: Deep Dive Into Performance Optimization April 2021 Minseok Lee English
GTC Spring 2021 Integrate HugeCTR Embedding with TensorFlow April 2021 Jianbing Dong English
GTC China 2020 MERLIN HUGECTR :深入研究性能优化 Oct 2020 Minseok Lee English
GTC China 2020 性能提升 7 倍 + 的高性能 GPU 广告推荐加速系统的落地实现 Oct 2020 Xiangting Kong 中文
GTC China 2020 使用 GPU EMBEDDING CACHE 加速 CTR 推理过程 Oct 2020 Fan Yu 中文
GTC China 2020 将 HUGECTR EMBEDDING 集成于 TENSORFLOW Oct 2020 Jianbing Dong 中文
GTC Spring 2020 HugeCTR: High-Performance Click-Through Rate Estimation Training March 2020 Minseok Lee, Joey Wang English
GTC China 2019 HUGECTR: GPU 加速的推荐系统训练 Oct 2019 Joey Wang 中文

Blogs

Conference / Website Title Date Authors Language
NVIDIA Devblog Accelerating Embedding with the HugeCTR TensorFlow Embedding Plugin Sept 2021 Vinh Nguyen, Ann Spencer, Joey Wang and Jianbing Dong English
medium.com Optimizing Meituan’s Machine Learning Platform: An Interview with Jun Huang Sept 2021 Sheng Luo and Benedikt Schifferer English
medium.com Leading Design and Development of the Advertising Recommender System at Tencent: An Interview with Xiangting Kong Sept 2021 Xiangting Kong, Ann Spencer English
NVIDIA Devblog 扩展和加速大型深度学习推荐系统 – HugeCTR 系列第 1 部分 June 2021 Minseok Lee 中文
NVIDIA Devblog 使用 Merlin HugeCTR 的 Python API 训练大型深度学习推荐模型 – HugeCTR 系列第 2 部分 June 2021 Vinh Nguyen 中文
medium.com Training large Deep Learning Recommender Models with Merlin HugeCTR’s Python APIs — HugeCTR Series Part 2 May 2021 Minseok Lee, Joey Wang, Vinh Nguyen and Ashish Sardana English
medium.com Scaling and Accelerating large Deep Learning Recommender Systems — HugeCTR Series Part 1 May 2021 Minseok Lee English
IRS 2020 Merlin: A GPU Accelerated Recommendation Framework Aug 2020 Even Oldridge etc. English
NVIDIA Devblog Introducing NVIDIA Merlin HugeCTR: A Training Framework Dedicated to Recommender Systems July 2020 Minseok Lee and Joey Wang English
Comments
  • [BUG] SparseOperationKit hangs on initialization

    [BUG] SparseOperationKit hangs on initialization

    Describe the bug A clear and concise description of what the bug is.

    To Reproduce Steps to reproduce the behavior:

    1. Build image based on gcr.io/deeplearning-platform-release/tf2-gpu.2-5 and install SOK
    FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
    
    COPY ./hugectr /usr/src/app/
    WORKDIR /usr/src/app/sparse_operation_kit
    # pythonpath setting in the ./install.sh fails, so we just exit even if we fail
    RUN ./install.sh --SM="70;75;80" --USE_NVTX=OFF; exit 0
    ENV PYTHONPATH "/usr/local/lib/:${PYTHONPATH}"
    WORKDIR /usr/src/app
    
    1. Run this code
    import os
    from absl import flags, app, logging
    import tensorflow as tf
    
    import numpy as np
    
    import sparse_operation_kit as sok
    
    flags.DEFINE_integer('num_items', 1024, 'Number of items in embedding.')
    
    FLAGS = flags.FLAGS
    
    batch_size = 4096
    
    os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, range(16)))
    
    def gen():
      while True:
        users = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
        items = tf.random.uniform([batch_size], 0, FLAGS.num_items, tf.int32)
        yield users, items, tf.random.normal([batch_size])
    
    
    class Model(tf.keras.Model):
    
      def __init__(self):
        super().__init__()
    
        self._embeddings = sok.DistributedEmbedding(
          combiner='mean',
          max_vocabulary_size_per_gpu=1024,
          embedding_vec_size=256,
          slot_num=2,
          max_nnz=2,
        )
    
      def call(self, inputs, training=False, mask=None):
        # Whatever the lookup is.
        logging.info(f'user: {inputs[0].shape}, {inputs[0].device}')
        return self._embeddings(tf.concat([inputs[0], inputs[1]], axis=1))
    
    
    def main(_):
      gpus = tf.config.list_physical_devices('GPU')
      logging.info(f'Found {gpus}')
      for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    
      # strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
      # strategy = tf.distribute.MirroredStrategy(['gpu:0'])
      # strategy = tf.distribute.MirroredStrategy()
      strategy = tf.distribute.MirroredStrategy(['gpu:0'], cross_device_ops=tf.distribute.NcclAllReduce())
    
      ds = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32, tf.float32))
      ds = ds.prefetch(10)
    
      with strategy.scope():
        logging.info(f'Initializing sok.')
        result = sok.Init(global_batch_size=1024)
        model = Model()
        emb_opt = tf.keras.optimizers.SGD(0.001)
        dense_opt = tf.keras.optimizers.SGD(0.001)
    
      # more code that is never reached
     
    if __name__ == '__main__':
      app.run(main)
    
    1. Run this on an A100 with 16 GPUs.

    Expected behavior This should proceed past the sok.Init but it does not. I get the usual tf initialization and GPU discovery logs and then I hit "Initializing sok." It does not proceed beyond the logs I have shown here.

    I1109 10:07:18.764635 139666590381888 model.py:56] Initializing sok.
    2021-11-09 10:07:20.041804: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
    2021-11-09 10:07:20.042466: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2200210000 Hz
    You are using the plugin with MirroredStrategy.
    hugectr-chief-0:1:1 [0] NCCL INFO Bootstrap : Using eth0:7.12.81.17<0>
    hugectr-chief-0:1:1 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
    hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
    hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
    hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
    hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : queue skip: 0
    hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket : Using [0]eth0:7.12.81.17<0>
    hugectr-chief-0:1:1 [0] NCCL INFO NET/FastSocket plugin initialized
    hugectr-chief-0:1:1 [0] NCCL INFO Using network FastSocket
    2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:81] Global seed is 314905248
    2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:82] Local GPU Count: 16
    2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:83] Global GPU Count: 1
    2021-11-09 10:07:20.452440: I sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/manager.cc:97] Global Replica Id: 0; Local Replica Id: 0
    NCCL version 2.10.3+cuda11.0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 00/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 01/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 02/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 03/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 04/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 05/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 06/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 07/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 08/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 09/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 10/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 11/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 12/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 13/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 14/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 15/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 16/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 17/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 18/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 19/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 20/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 21/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 22/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 23/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 24/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 25/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 26/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 27/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 28/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 29/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 30/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Channel 31/32 :    0
    hugectr-chief-0:1:251 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
    hugectr-chief-0:1:251 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
    hugectr-chief-0:1:251 [0] NCCL INFO Connected all rings
    hugectr-chief-0:1:251 [0] NCCL INFO Connected all trees
    hugectr-chief-0:1:251 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
    hugectr-chief-0:1:251 [0] NCCL INFO comm 0x7efdbc00da30 rank 0 nranks 1 cudaDev 0 busId 40 - Init COMPLETE
    2021-11-09 10:07:20.979918: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
    

    None of the strategy instantiations shown above proceed beyond this point:

    • all GPUs vs single GPUs
    • hierarchical vs nccl

    Screenshots If applicable, add screenshots to help explain your problem.

    Environment (please complete the following information):

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
    | N/A   34C    P0    58W / 400W |    714MiB / 40537MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    

    with 16 GPUs.

    I also made sure to give the container a /dev/shm of 16Gi.

    Additional context Add any other context about the problem here.

    bug P0 
    opened by rllin 53
  • [BUG] Installing of sparse_operation_kit from pip failed

    [BUG] Installing of sparse_operation_kit from pip failed

    Describe the bug A clear and concise description of what the bug is.

    To Reproduce Steps to reproduce the behavior:

    1. Use AMI https://aws.amazon.com/releasenotes/deep-learning-ami-gpu-tensorflow-2-9-ubuntu-20-04/ to spin a ec2 cluster instance type g4dn.xxlarge
    2. pip3.9 install sparse_operation_kit

    Logs

    Defaulting to user installation because normal site-packages is not writeable
    Collecting sparse_operation_kit
      Using cached sparse_operation_kit-1.1.2-py3-none-any.whl
    Collecting merlin-sok
      Using cached merlin-sok-1.1.3.tar.gz (152 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Preparing metadata (pyproject.toml) ... done
    Building wheels for collected packages: merlin-sok
      Building wheel for merlin-sok (pyproject.toml) ... error
      error: subprocess-exited-with-error
      
      × Building wheel for merlin-sok (pyproject.toml) did not run successfully.
      │ exit code: 1
      ╰─> [130 lines of output]
          running bdist_wheel
          running build
          running build_py
          creating build
          creating build/lib.linux-x86_64-3.9
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit
          copying ./sparse_operation_kit/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
          copying ./sparse_operation_kit/kit_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/_version.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/embedding_layer_handle.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/context_scope.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/embedding_variable_v2.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/inplace_initializer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/embedding_variable_v1.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/initialize.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          copying ./sparse_operation_kit/core/graph_keys.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
          copying ./sparse_operation_kit/operations/compat_ops_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
          copying ./sparse_operation_kit/operations/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
          copying ./sparse_operation_kit/saver/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
          copying ./sparse_operation_kit/saver/Saver.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
          copying ./sparse_operation_kit/embeddings/all2all_dense_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
          copying ./sparse_operation_kit/embeddings/embedding_ops.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
          copying ./sparse_operation_kit/embeddings/tf_distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
          copying ./sparse_operation_kit/embeddings/distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
          copying ./sparse_operation_kit/embeddings/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
          copying ./sparse_operation_kit/embeddings/get_embedding_op.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
          copying ./sparse_operation_kit/optimizers/optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
          copying ./sparse_operation_kit/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
          copying ./sparse_operation_kit/optimizers/utils.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
          copying ./sparse_operation_kit/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
          copying ./sparse_operation_kit/optimizers/base_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
          copying ./sparse_operation_kit/tf/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
          copying ./sparse_operation_kit/tf/keras/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
          copying ./sparse_operation_kit/tf/keras/mixed_precision/loss_scale_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
          copying ./sparse_operation_kit/tf/keras/mixed_precision/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
          creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
          copying ./sparse_operation_kit/tf/keras/optimizers/common.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
          copying ./sparse_operation_kit/tf/keras/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
          copying ./sparse_operation_kit/tf/keras/optimizers/lazy_adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
          copying ./sparse_operation_kit/tf/keras/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
          running build_ext
          -- The CXX compiler identification is GNU 9.4.0
          -- The CUDA compiler identification is NVIDIA 11.2.152
          -- Detecting CXX compiler ABI info
          -- Detecting CXX compiler ABI info - done
          -- Check for working CXX compiler: /usr/bin/c++ - skipped
          -- Detecting CXX compile features
          -- Detecting CXX compile features - done
          -- Detecting CUDA compiler ABI info
          -- Detecting CUDA compiler ABI info - done
          -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
          -- Detecting CUDA compile features
          -- Detecting CUDA compile features - done
          -- Building Sparse Operation Kit from source.
          -- Looking for C++ include pthread.h
          -- Looking for C++ include pthread.h - found
          -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
          -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
          -- Looking for pthread_create in pthreads
          -- Looking for pthread_create in pthreads - not found
          -- Looking for pthread_create in pthread
          -- Looking for pthread_create in pthread - found
          -- Found Threads: TRUE
          -- Found CUDA: /usr/local/cuda (found version "11.2")
          CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
            Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARIES)
          Call Stack (most recent call first):
            /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
            cmakes/FindNCCL.cmake:36 (find_package_handle_standard_args)
            CMakeLists.txt:25 (find_package)
          
          
          -- Configuring incomplete, errors occurred!
          See also "/tmp/pip-install-w5u7if11/merlin-sok_38d3b0c9055e49fba249838a52d59899/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
          See also "/tmp/pip-install-w5u7if11/merlin-sok_38d3b0c9055e49fba249838a52d59899/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeError.log".
          Traceback (most recent call last):
            File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
              main()
            File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
              json_out['return_val'] = hook(**hook_input['kwargs'])
            File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 261, in build_wheel
              return _build_backend().build_wheel(wheel_directory, config_settings,
            File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 230, in build_wheel
              return self._build_with_temp_dir(['bdist_wheel'], '.whl',
            File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 215, in _build_with_temp_dir
              self.run_setup()
            File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 158, in run_setup
              exec(compile(code, __file__, 'exec'), locals())
            File "setup.py", line 182, in <module>
              setup(
            File "/usr/local/lib/python3.9/dist-packages/setuptools/__init__.py", line 153, in setup
              return distutils.core.setup(**attrs)
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 148, in setup
              return run_commands(dist)
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 163, in run_commands
              dist.run_commands()
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 967, in run_commands
              self.run_command(cmd)
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
              cmd_obj.run()
            File "/tmp/pip-build-env-et8cy105/overlay/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 299, in run
              self.run_command('build')
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
              self.distribution.run_command(command)
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
              cmd_obj.run()
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build.py", line 135, in run
              self.run_command(cmd_name)
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
              self.distribution.run_command(command)
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
              cmd_obj.run()
            File "/usr/local/lib/python3.9/dist-packages/setuptools/command/build_ext.py", line 79, in run
              _build_ext.run(self)
            File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
              self.build_extensions()
            File "setup.py", line 102, in build_extensions
              subprocess.check_call("cmake {} {} && make -j{}".format(cmake_args,
            File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
              raise CalledProcessError(retcode, cmd)
          subprocess.CalledProcessError: Command 'cmake -DSM='70;75;80' -DUSE_NVTX=OFF -DSOK_ASYNC=ON -DSOK_UNIT_TEST=OFF -DCMAKE_BUILD_TYPE=Release /tmp/pip-install-w5u7if11/merlin-sok_38d3b0c9055e49fba249838a52d59899 && make -j$(nproc)' returned non-zero exit status 1.
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for merlin-sok
    Failed to build merlin-sok
    ERROR: Could not build wheels for merlin-sok, which is required to install pyproject.toml-based projects
    

    Environment (please complete the following information):

    • OS: Ubuntu 20.04
    • AMI: https://aws.amazon.com/releasenotes/deep-learning-ami-gpu-tensorflow-2-9-ubuntu-20-04/
    • CUDA version: 11.2

    Additional context Already tried adding NCCL_INCLUDE_DIR=/usr/local/cuda-11.2/include/ and NCCL_LIBRARIES=/usr/local/cuda-11.2/lib/ as environment variables but getting the same error.

    opened by silpara 17
  • [Question]Error with docker

    [Question]Error with docker

    After building docker using https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/training/dockerfile.ctr and cmake、make install HugeCTR, I cant import HugeCTR in this docker. image What i can do to solve this problem?

    question 
    opened by politefish 12
  • [BUG] Unable to use MirroredStrategy with Sparse Operation Kit

    [BUG] Unable to use MirroredStrategy with Sparse Operation Kit

    Describe the bug

    I'm trying to initialize a model with a tf.keras.Input layer being directly fed to an instance of sok.embeddings.all2all_dense_embedding.All2AllDenseEmbedding while using tensorflow's mirroredstrategy. When I try to call the embedding layer however, I get the following error:

        embeddings = embedding(inputs)
      File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 976, in __call__
        return self._functional_construction_call(inputs, args, kwargs,
      File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 1114, in _functional_construction_call
        outputs = self._keras_tensor_symbolic_call(
      File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 848, in _keras_tensor_symbolic_call
        return self._infer_output_signature(inputs, args, kwargs, input_masks)
      File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 888, in _infer_output_signature
        outputs = call_fn(inputs, *args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 695, in wrapper
        raise e.ag_error_metadata.to_exception(e)
    AttributeError: in user code:
    
        /usr/local/lib/python3.8/dist-packages/SparseOperationKit-1.1.0-py3.8-linux-x86_64.egg/sparse_operation_kit/embeddings/all2all_dense_embedding.py:132 call  *
            emb_vector = embedding_ops.embedding_lookup(embedding_variable=self.var,
        /usr/local/lib/python3.8/dist-packages/SparseOperationKit-1.1.0-py3.8-linux-x86_64.egg/sparse_operation_kit/embeddings/embedding_ops.py:86 embedding_lookup  *
            embedding_layer = embedding_variable.embedding_layer
        /usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/values.py:280 __getattr__
            return getattr(self._get(), name)
        /usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/values.py:741 _get
            return self._get_cross_replica()
        /usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/values.py:909 _get_cross_replica
            return self._policy._get_cross_replica(self)  # pylint: disable=protected-access
    
        AttributeError: 'VariableSynchronization' object has no attribute '_get_cross_replica'
    

    I initialize the layer with the arguments max_vocabulary_size_per_gpu=int(.75 * global_vocabulary_size), slot_num=1 and nnz_per_slot=1.

    Is there anything else I need to do to call this layer while creating my tf model?

    To Reproduce/Environment This is on a GCP VM with 2 V100s. I've tried to run this both in the nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.01 and nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.11 docker images. I am installing my own packages into the container, but nothing that overwrites the already installed version of tensorflow and tensorflow-related packages.

    An example command I use to spin up the image: docker run --rm --entrypoint bash -it -v $(pwd):$(pwd) --gpus=all --ipc=host nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.01

    bug 
    opened by drubinstein 11
  • [BUG] Unable to run multi-node

    [BUG] Unable to run multi-node

    Describe the bug Followed the instructions provided in https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training and setup the environment exactly as suggested. Including building HugeCTR separately with MULTI_NODE_ENABLED. However when trying to run it using run_multinode.sh receive the following error -

    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:
    
    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default
    
    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.
    
    * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to use.
    
    *  compilation of the orted with dynamic libraries when static are required
      (e.g., on Cray). Please check your configure cmd line and consider using
      one of the contrib/platform definitions for your system type.
    
    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).
    
    

    To Reproduce Steps to reproduce the behavior:

    1. Build docker container using instructions provided here - https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development
    2. Configured build directory in run_multinode.sh
    3. bash run_multinode.sh Expected behavior Successful execution of script.

    Environment (please complete the following information):

    • OS: Ubuntu 18.04
    • Graphic card: Nvidia P100
    • CUDA version: CUDA 11.2
    • Docker image - Followed the docker file provided here https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/training/dockerfile.ctr
    bug P2 fea::user experience 
    opened by iidsample 9
  • [Question] Some questions in running DLRM sample

    [Question] Some questions in running DLRM sample

    I am trying the DLRM sample in the main branch, where I face a few problems.

    1. There is no docker "nvcr.io/nvidia/merlin/merlin-training:22.01" as listed in the README.md. To get around it, I use the lasted tag, i.e., 21.12 (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-training/tags) instead.
    2. When running command dlrm_raw ./ ./ for kaggle dataset, it raises a Segmentation fault. demesg shows the message: dlrm_raw[28741]: segfault at 8 ip 00007f8efc69daaf sp 00007fff4b869960 error 4 in libcudf.so[7f8efbc6f000+1ac0000]

    The GPU info is attached image

    question 
    opened by HydraZeng 9
  • [BUG] tensorflow hang when use sparse_operation_kit with multiple embedding layers, maybe caused by gradient tape.

    [BUG] tensorflow hang when use sparse_operation_kit with multiple embedding layers, maybe caused by gradient tape.

    Describe the bug A clear and concise description of what the bug is.

    When use multiple embedding layers (cause we have multiple embedding_vec_size), and there maybe a shared variable usage because we want to distributed and all2all dense to use the same table. In training loop, I carefully use tf.control_dependencies to make nccl launched in order, while tf.gradient_tape maybe parallel launch nccl (embedding layers backward propagation), there will be hanged in V100 or A10. However, two-T4 will not be hanged on the same model.

    To Reproduce Steps to reproduce the behavior: I can give a code snippet in my github later.

    Expected behavior A clear and concise description of what you expected to happen. For tensorflow2 that must be used with gradient tape, how can we fix this?

    Screenshots If applicable, add screenshots to help explain your problem. image

    Environment (please complete the following information):

    • Docker image: tensorflow-merlin-0.6
    opened by marsmiao 9
  • [Question] How Incremental Training Can be Achieved via HugeCTR' Model ?

    [Question] How Incremental Training Can be Achieved via HugeCTR' Model ?

    In Nvidia's Developer Blog, it says HugeCTR' model can be updated by incremental data. But it seems like there's no document / sample code to explain how it can be done.

    The closest script I find is: https://github.com/NVIDIA/HugeCTR/blob/master/test/pybind_test/wdl_mos_high_level.py

    Am I on the right track?

    question 
    opened by tim5go 9
  • [Question] Kernel dies while creating Parameter server for HugeCTR Inference

    [Question] Kernel dies while creating Parameter server for HugeCTR Inference

    I followed this hugectr inference notebook to create an inference code with DLRM config file. Now, every time I am trying to create the parameter server object, kernel dies without any exception.

    parameter_server= CreateParameterServer(['./config_files/dlrm_inference.json'], ['DLRM'], False)

    I am attaching the dlrm_inference.jsonfile for reference

    {
        "inference": {
        "max_batchsize": 1024,
        "dense_model_file": "./hugeCTR_saved_model_DLRM/_dense_18000.model",
        "sparse_model_file": "./hugeCTR_saved_model_DLRM/0_sparse_18000.model"
      },
    
        "layers": [ 
              {
              "name": "data",
              "type": "Data",
              "slot_size_array": [49865, 54893, 337691, 9523, 164],
              "slot_size_array_orig": [49865, 54893, 337691, 9523, 164],
              "source": "./hugeCTR/filelist.txt",
              "eval_source":"./hugeCTR/valid_filelist.txt",
              "check": "None",
              "cache_eval_data": true,
              "label": {
                      "top": "label",
                      "label_dim": 1
              },
              "dense": {
                      "top": "dense",
                      "dense_dim": 2
              },
              "sparse": [
                      {
                  "top": "data1",
                  "type": "LocalizedSlot",
                  "max_feature_num_per_sample": 5,
                  "max_nnz": 1,
                  "slot_num": 5
                      }
              ]
            },
    
            {
              "name": "sparse_embedding1",
              "type": "LocalizedSlotSparseEmbeddingHash",
              "bottom": "data1",
              "top": "sparse_embedding1",
              "sparse_embedding_hparam": {
                "slot_size_array": [49865, 54893, 337691, 9523, 164],
                "embedding_vec_size": 64,
                "combiner": 0
              }
            },
    
            {
              "name": "fc1",
              "type": "FusedInnerProduct",
              "bottom": "dense",
              "top": "fc1",
              "fc_param": {
                "num_output": 64
              }
            },
    
            {
              "name": "fc2",
              "type": "FusedInnerProduct",
              "bottom": "fc1",
              "top": "fc2",
              "fc_param": {
                "num_output": 128
              }
            },
    
    
            {
              "name": "fc3",
              "type": "FusedInnerProduct",
              "bottom": "fc2",
              "top": "fc3",
              "fc_param": {
                "num_output": 64
              }
            },
    
            {
              "name": "interaction1",
              "type": "Interaction",
              "bottom": ["fc3", "sparse_embedding1"],
              "top": "interaction1"
            },
    
            {
              "name": "fc4",
              "type": "FusedInnerProduct",
              "bottom": "interaction1",
              "top": "fc4",
               "fc_param": {
                "num_output": 1024
              }
            },
    
            {
              "name": "fc5",
              "type": "FusedInnerProduct",
              "bottom": "fc4",
              "top": "fc5",
              "fc_param": {
                "num_output": 1024
              }
            },
    
            {
              "name": "fc6",
              "type": "FusedInnerProduct",
              "bottom": "fc5",
              "top": "fc6",
              "fc_param": {
                "num_output": 512
              }
            },
    
            {
              "name": "fc7",
              "type": "FusedInnerProduct",
              "bottom": "fc6",
              "top": "fc7",
              "fc_param": {
                "num_output": 256
              }
            },
    
            {
              "name": "fc8",
              "type": "InnerProduct",
              "bottom": "fc7",
              "top": "fc8",
              "fc_param": {
                "num_output": 1
              }
            },
            
            {
              "name": "sigmoid",
              "type": "Sigmoid",
              "bottom": "fc8",
              "top": "sigmoid"
            }
        ]   
    }
    
    
    question 
    opened by ashutoshk0 9
  • [BUG] Criteo 1TB dataset missing

    [BUG] Criteo 1TB dataset missing

    Describe the bug Is there another source for the criteo data by day?

    The links at https://labs.criteo.com/2013/12/download-terabyte-click-logs/ like http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_0.gz seem unavailable.

    If there is a mirror, this would be great. The smaller dataset I found but not the full 1TB.

    bug P0 
    opened by rllin 8
  • [Question] How to Weigh Certain Training Samples Differently Depending on the Label?

    [Question] How to Weigh Certain Training Samples Differently Depending on the Label?

    Hello,

    I have both implicit and explicit labels that I'd like to weigh differently. Could you please provide guidance on how we could do that?

    I see that BinaryCrossEntropy only supports binary labels, but I also see that CrossEntropyLoss is implemented. However, I am a bit confused on how to use CrossEntropyLoss. In the document, it says that the input/output shapes are

    - input: [(batch_size, 2), (batch_size, 2)] where the first tensor represents the predictions while the second tensor represents the labels
    - output: (batch_size, 2)
    

    Why does the second dimension for the labels has to be 2?

    I see from here that CrossEntropyLoss is supposed to mimic tf.keras.losses.SparseCategoricalCrossentropy, but from my understanding, tf.keras.losses.SparseCategoricalCrossentropy's label tensor does not have to have a second dimension of 2. TF's webpage here says it can do

    y_true = [1, 2]
    y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
    # Using 'auto'/'sum_over_batch_size' reduction type.
    scce = tf.keras.losses.SparseCategoricalCrossentropy()
    scce(y_true, y_pred).numpy()
    

    Thank you for your help!

    question 
    opened by shoyasaxa 8
  • original error: libcuda.so.1: cannot open shared object file: No such file or directory,a problem occurred in the docker image nvcr.io/nvidia/tensorflow:22.06-tf2-py3

    original error: libcuda.so.1: cannot open shared object file: No such file or directory,a problem occurred in the docker image nvcr.io/nvidia/tensorflow:22.06-tf2-py3

    Describe the bug i pull the hugectr mirror inorder to test sok(hugectr tensorflow embedding plugin),when i execute the python script

    python -c "import nupy as cp"
    

    a problem occurred

    original error: libcuda.so.1: cannot open shared object file: No such file or directory
    

    Environment (please complete the following information):

    • Docker image: nvcr.io/nvidia/tensorflow:22.06-tf2-py3

    solve

    echo $LD_LIBRARY_PATH
    # result: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
    

    the path "/usr/local/cuda/compat/lib" does not exist, while another path "/usr/local/cuda/compat/lib.real" exists, and file “libcuda.so.1” exists in the second path, so i crete a soft link to solve the problem

    ln -s /usr/local/cuda/compat/lib.real /usr/local/cuda/compat/lib
    
    opened by shijiexu09 1
  • [Requirement] TLS communication for cloud-hosted HPS

    [Requirement] TLS communication for cloud-hosted HPS

    Description

    It would be great if TLS communication were added as an option to client initialization in redis_backend.cpp

    Adding a TLS option would secure communication between the HPS and HugeCTR when hosting the CPU cache component of the HPS separate (not the same server) from the triton server instance.

    Benefits

    Adding TLS communication would further a cloud-hosted HPS which has the following benefits:

    • Security: only option for security today is basic ACL auth. TLS would improve this.
    • Cost: this would be specifically advantageous for users looking to utilize cheap CPU cloud instances with high memory for the embedding storage.
    • Infrastructure simplicity:
      • Many of the cloud hosted versions redis (i.e. Redis Enterprise) are compatible with existing redis client and very easy to setup, manage, and scale.
      • Expanded storage options: Some cloud offerings offer flash storage which replaces the need to setup per tritonserver or NFS rocksdb further decreasing amount of setup in order to put HPS into production.

    Implementation

    Add parameters to the redis client intiialization to allow the passing of key and pem files for TLS communication. the dependency used for Redis communication already has support for this. See an example here.

    Keep in mind, This will require a build flag for Redis-plus-plus and openssl to be available, but I don't believe that to be a preventative hurdle here as this is relatively standard practice for secure communication.

    Alternatives

    Only offer ACL auth forgoing the above benefits outlined for cloud hosted embedding storage.

    Additional context

    Adding unix socket support here as well might be a good idea since it's touching the same area of the codebase. This would improve performance in co-located settings when that is preferred. I can write up a separate ticket for this though.

    fea::functional requirement 
    opened by Spartee 5
  • [BUG] HPS tensorflow plugin, multi-gpu example crashes

    [BUG] HPS tensorflow plugin, multi-gpu example crashes

    Describe the bug HPS example hps_pretrained_model_training_demo.ipynb crashes.

    To Reproduce Steps to reproduce the behavior:

    docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:22.09
    docker run --runtime=nvidia -it nvcr.io/nvidia/merlin/merlin-tensorflow:22.09 bash
    cd /hugectr/hierarchical_parameter_server/notebooks
    # run all code blocks of hps_pretrained_model_training_demo.ipynb with dnn.json in notebook
    

    Expected behavior The pre-trained model can be loaded with HPS & trained.

    Screenshots

    [email protected]:/hugectr/hierarchical_parameter_server/notebooks# python hps_pretrained_model_training_demo.py
    [INFO] hierarchical_parameter_server is imported
    2022-10-01 14:15:47.658583: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
    /usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
      warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
    2022-10-01 14:15:49.377492: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2022-10-01 14:15:51.011734: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
    2022-10-01 14:15:51.011800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 77658 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:1f:00.0, compute capability: 8.0
    2022-10-01 14:15:51.013072: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
    2022-10-01 14:15:51.013100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77658 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:25:00.0, compute capability: 8.0
    2022-10-01 14:15:51.014485: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
    2022-10-01 14:15:51.014514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 77658 MB memory:  -> device: 2, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:50:00.0, compute capability: 8.0
    2022-10-01 14:15:51.015644: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
    2022-10-01 14:15:51.015668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 77658 MB memory:  -> device: 3, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:55:00.0, compute capability: 8.0
    WARNING:tensorflow:The following Variables were used in a Lambda layer's call (tf.compat.v1.nn.embedding_lookup_sparse), but are not present in its tracked objects:   <tf.Variable 'Variable:0' shape=(100000, 16) dtype=float32>. This is a strong indication that the Lambda layer should be rewritten as a subclassed Layer.
    Model: "model"
    __________________________________________________________________________________________________
     Layer (type)                   Output Shape         Param #     Connected to                     
    ==================================================================================================
     input_1 (InputLayer)           [(None, 5)]          0           []                               
                                                                                                      
     tf.compat.v1.nn.embedding_look  (None, 16)          0           ['input_1[0][0]']                
     up_sparse (TFOpLambda)                                                                           
                                                                                                      
     tf.reshape (TFOpLambda)        (None, 160)          0           ['tf.compat.v1.nn.embedding_looku
                                                                     p_sparse[0][0]']                 
                                                                                                      
     input_2 (InputLayer)           [(None, 10)]         0           []                               
                                                                                                      
     tf.concat (TFOpLambda)         (None, 170)          0           ['tf.reshape[0][0]',             
                                                                      'input_2[0][0]']                
                                                                                                      
     fc1 (Dense)                    (None, 1024)         175104      ['tf.concat[0][0]']              
                                                                                                      
     fc2 (Dense)                    (None, 256)          262400      ['fc1[0][0]']                    
                                                                                                      
     fc3 (Dense)                    (None, 1)            257         ['fc2[0][0]']                    
                                                                                                      
    ==================================================================================================
    Total params: 437,761
    Trainable params: 437,761
    Non-trainable params: 0
    __________________________________________________________________________________________________
    WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
    WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
    2022-10-01 14:15:54.925159: I tensorflow/stream_executor/cuda/cuda_blas.cc:1804] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
    /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1082: UserWarning: "`binary_crossentropy` received `from_logits=True`, but the `output` argument was produced by a sigmoid or softmax activation and thus does not represent logits. Was this intended?"
      return dispatch_target(*args, **kwargs)
    WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
    -------------------- Step 0, loss: PerReplica:{
      0: tf.Tensor(0.17562991, shape=(), dtype=float32),
      1: tf.Tensor(0.17909361, shape=(), dtype=float32),
      2: tf.Tensor(0.17878108, shape=(), dtype=float32),
      3: tf.Tensor(0.17324439, shape=(), dtype=float32)
    } --------------------
    WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
    WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
    WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
    -------------------- Step 1, loss: PerReplica:{
      0: tf.Tensor(653.8149, shape=(), dtype=float32),
      1: tf.Tensor(693.7608, shape=(), dtype=float32),
      2: tf.Tensor(613.2731, shape=(), dtype=float32),
      3: tf.Tensor(628.3385, shape=(), dtype=float32)
    } --------------------
    WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
    WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
    -------------------- Step 2, loss: PerReplica:{
      0: tf.Tensor(37.584198, shape=(), dtype=float32),
      1: tf.Tensor(36.131, shape=(), dtype=float32),
      2: tf.Tensor(38.500664, shape=(), dtype=float32),
      3: tf.Tensor(37.32876, shape=(), dtype=float32)
    } --------------------
    -------------------- Step 3, loss: PerReplica:{
      0: tf.Tensor(5.023567, shape=(), dtype=float32),
      1: tf.Tensor(3.7619786, shape=(), dtype=float32),
      2: tf.Tensor(4.988394, shape=(), dtype=float32),
      3: tf.Tensor(4.648823, shape=(), dtype=float32)
    } --------------------
    WARNING:tensorflow:5 out of the last 5 calls to <function _apply_all_reduce.<locals>._all_reduce at 0x7fc34c647940> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
    -------------------- Step 4, loss: PerReplica:{
      0: tf.Tensor(1.080203, shape=(), dtype=float32),
      1: tf.Tensor(1.2417698, shape=(), dtype=float32),
      2: tf.Tensor(1.2622243, shape=(), dtype=float32),
      3: tf.Tensor(1.1184206, shape=(), dtype=float32)
    } --------------------
    WARNING:tensorflow:6 out of the last 6 calls to <function _apply_all_reduce.<locals>._all_reduce at 0x7fc34c647ee0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
    -------------------- Step 5, loss: PerReplica:{
      0: tf.Tensor(0.654034, shape=(), dtype=float32),
      1: tf.Tensor(0.7189002, shape=(), dtype=float32),
      2: tf.Tensor(0.66333723, shape=(), dtype=float32),
      3: tf.Tensor(0.6037976, shape=(), dtype=float32)
    } --------------------
    -------------------- Step 6, loss: PerReplica:{
      0: tf.Tensor(0.79754734, shape=(), dtype=float32),
      1: tf.Tensor(0.9231312, shape=(), dtype=float32),
      2: tf.Tensor(0.90430397, shape=(), dtype=float32),
      3: tf.Tensor(0.91203874, shape=(), dtype=float32)
    } --------------------
    -------------------- Step 7, loss: PerReplica:{
      0: tf.Tensor(0.22423872, shape=(), dtype=float32),
      1: tf.Tensor(0.211602, shape=(), dtype=float32),
      2: tf.Tensor(0.2190841, shape=(), dtype=float32),
      3: tf.Tensor(0.19895837, shape=(), dtype=float32)
    } --------------------
    -------------------- Step 8, loss: PerReplica:{
      0: tf.Tensor(1.7644451, shape=(), dtype=float32),
      1: tf.Tensor(1.7413795, shape=(), dtype=float32),
      2: tf.Tensor(1.6232728, shape=(), dtype=float32),
      3: tf.Tensor(1.5175638, shape=(), dtype=float32)
    } --------------------
    -------------------- Step 9, loss: PerReplica:{
      0: tf.Tensor(0.35069197, shape=(), dtype=float32),
      1: tf.Tensor(0.32513526, shape=(), dtype=float32),
      2: tf.Tensor(0.30032104, shape=(), dtype=float32),
      3: tf.Tensor(0.3842827, shape=(), dtype=float32)
    } --------------------
    You are using the plugin with MirroredStrategy.
    =====================================================HPS Parse====================================================
    [HCTR][14:16:00.618][INFO][RK0][main]: dense_file is not specified using default: 
    [HCTR][14:16:00.618][INFO][RK0][main]: num_of_refresher_buffer_in_pool is not specified using default: 1
    [HCTR][14:16:00.618][INFO][RK0][main]: maxnum_des_feature_per_sample is not specified using default: 26
    [HCTR][14:16:00.618][INFO][RK0][main]: refresh_delay is not specified using default: 0
    [HCTR][14:16:00.618][INFO][RK0][main]: refresh_interval is not specified using default: 0
    ====================================================HPS Create====================================================
    [HCTR][14:16:00.618][INFO][RK0][main]: Creating HashMap CPU database backend...
    [HCTR][14:16:00.618][DEBUG][RK0][main]: Created blank database backend in local memory!
    [HCTR][14:16:00.619][INFO][RK0][main]: Volatile DB: initial cache rate = 1
    [HCTR][14:16:00.619][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
    [HCTR][14:16:00.619][DEBUG][RK0][main]: Created raw model loader in local memory!
    [HCTR][14:16:00.760][INFO][RK0][main]: Table: hps_et.dnn.sparse_embedding0; cached 100000 / 100000 embeddings in volatile database (HashMapBackend); load: 100000 / 18446744073709551615 (0.00%).
    [HCTR][14:16:00.760][DEBUG][RK0][main]: Real-time subscribers created!
    [HCTR][14:16:00.760][INFO][RK0][main]: Creating embedding cache in device 0.
    [HCTR][14:16:00.768][INFO][RK0][main]: Model name: dnn
    [HCTR][14:16:00.768][INFO][RK0][main]: Number of embedding tables: 1
    [HCTR][14:16:00.768][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
    [HCTR][14:16:00.768][INFO][RK0][main]: Use I64 input key: True
    [HCTR][14:16:00.768][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
    [HCTR][14:16:00.768][INFO][RK0][main]: The size of thread pool: 112
    [HCTR][14:16:00.768][INFO][RK0][main]: The size of worker memory pool: 3
    [HCTR][14:16:00.768][INFO][RK0][main]: The size of refresh memory pool: 1
    [HCTR][14:16:00.768][INFO][RK0][main]: The refresh percentage : 0.200000
    [HCTR][14:16:00.780][INFO][RK0][main]: Creating embedding cache in device 1.
    [HCTR][14:16:00.787][INFO][RK0][main]: Model name: dnn
    [HCTR][14:16:00.787][INFO][RK0][main]: Number of embedding tables: 1
    [HCTR][14:16:00.787][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
    [HCTR][14:16:00.787][INFO][RK0][main]: Use I64 input key: True
    [HCTR][14:16:00.787][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
    [HCTR][14:16:00.787][INFO][RK0][main]: The size of thread pool: 112
    [HCTR][14:16:00.787][INFO][RK0][main]: The size of worker memory pool: 3
    [HCTR][14:16:00.787][INFO][RK0][main]: The size of refresh memory pool: 1
    [HCTR][14:16:00.787][INFO][RK0][main]: The refresh percentage : 0.200000
    [HCTR][14:16:00.790][INFO][RK0][main]: Creating embedding cache in device 2.
    [HCTR][14:16:00.796][INFO][RK0][main]: Model name: dnn
    [HCTR][14:16:00.796][INFO][RK0][main]: Number of embedding tables: 1
    [HCTR][14:16:00.796][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
    [HCTR][14:16:00.796][INFO][RK0][main]: Use I64 input key: True
    [HCTR][14:16:00.796][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
    [HCTR][14:16:00.796][INFO][RK0][main]: The size of thread pool: 112
    [HCTR][14:16:00.796][INFO][RK0][main]: The size of worker memory pool: 3
    [HCTR][14:16:00.796][INFO][RK0][main]: The size of refresh memory pool: 1
    [HCTR][14:16:00.796][INFO][RK0][main]: The refresh percentage : 0.200000
    [HCTR][14:16:00.799][INFO][RK0][main]: Creating embedding cache in device 3.
    [HCTR][14:16:00.805][INFO][RK0][main]: Model name: dnn
    [HCTR][14:16:00.805][INFO][RK0][main]: Number of embedding tables: 1
    [HCTR][14:16:00.805][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
    [HCTR][14:16:00.805][INFO][RK0][main]: Use I64 input key: True
    [HCTR][14:16:00.805][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
    [HCTR][14:16:00.805][INFO][RK0][main]: The size of thread pool: 112
    [HCTR][14:16:00.805][INFO][RK0][main]: The size of worker memory pool: 3
    [HCTR][14:16:00.805][INFO][RK0][main]: The size of refresh memory pool: 1
    [HCTR][14:16:00.805][INFO][RK0][main]: The refresh percentage : 0.200000
    [HCTR][14:16:00.866][DEBUG][RK0][main]: Created raw model loader in local memory!
    [HCTR][14:16:00.869][INFO][RK0][main]: EC initialization for model: "dnn", num_tables: 1
    [HCTR][14:16:00.870][INFO][RK0][main]: EC initialization on device: 0
    [HCTR][14:16:00.871][INFO][RK0][main]: EC initialization on device: 1
    [HCTR][14:16:00.872][INFO][RK0][main]: EC initialization on device: 2
    [HCTR][14:16:00.873][INFO][RK0][main]: EC initialization on device: 3
    [HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 0
    [HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 1
    [HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 2
    [HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 3
    Model: "model_1"
    __________________________________________________________________________________________________
     Layer (type)                   Output Shape         Param #     Connected to                     
    ==================================================================================================
     input_3 (InputLayer)           [(None, 5)]          0           []                               
                                                                                                      
     sparse_lookup_layer (SparseLoo  (None, 16)          0           ['input_3[0][0]']                
     kupLayer)                                                                                        
                                                                                                      
     tf.reshape_1 (TFOpLambda)      (None, 160)          0           ['sparse_lookup_layer[0][0]']    
                                                                                                      
     input_4 (InputLayer)           [(None, 10)]         0           []                               
                                                                                                      
     tf.concat_1 (TFOpLambda)       (None, 170)          0           ['tf.reshape_1[0][0]',           
                                                                      'input_4[0][0]']                
                                                                                                      
     new_fc (Dense)                 (None, 1)            171         ['tf.concat_1[0][0]']            
                                                                                                      
    ==================================================================================================
    Total params: 171
    Trainable params: 171
    Non-trainable params: 0
    __________________________________________________________________________________________________
    Traceback (most recent call last):
      File "hps_pretrained_model_training_demo.py", line 307, in <module>
        model = train_with_pretrained_embeddings(args)
      File "hps_pretrained_model_training_demo.py", line 301, in train_with_pretrained_embeddings
        _, loss = strategy.run(_train_step, args=(inputs, labels))
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1312, in run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2888, in call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 676, in _call_for_each_replica
        return mirrored_run.call_for_each_replica(
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 101, in call_for_each_replica
        return _call_for_each_replica(strategy, fn, args, kwargs)
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 283, in _call_for_each_replica
        coord.join(threads)
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/coordinator.py", line 385, in join
        six.reraise(*self._exc_info_to_raise)
      File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise
        raise value
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
        yield
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 386, in run
        self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
      File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 595, in wrapper
        return func(*args, **kwargs)
      File "hps_pretrained_model_training_demo.py", line 284, in _train_step
        logit, _ = model(inputs)
      File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None
      File "hps_pretrained_model_training_demo.py", line 252, in call
        embeddings = tf.reshape(self.sparse_lookup_layer(sp_ids=input_cat, sp_weights = None, combiner=self.combiner),
      File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/sparse_lookup_layer.py", line 200, in call
        embeddings = lookup_ops.lookup(
      File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/lookup_ops.py", line 98, in lookup
        status = Init(ps_config_file=ps_config_file, global_batch_size=global_batch_size)
      File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/initialize.py", line 225, in Init
        _init_results = _init_wrapper(_run_fn, _init_fn, **kwargs)
      File "/tmp/__autograph_generated_file6j_afal8.py", line 12, in tf___init_wrapper
        retval_ = ag__.converted_call(ag__.ld(run_fn), (ag__.ld(init_fn),), dict(kwargs=ag__.ld(kwargs)), fscope)
    RuntimeError: Exception encountered when calling layer "sparse_lookup_layer" (type SparseLookupLayer).
    
    in user code:
    
        File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/initialize.py", line 211, in _init_wrapper  *
            return run_fn(init_fn, kwargs=kwargs)
    
        RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()
    
    
    Call arguments received by layer "sparse_lookup_layer" (type SparseLookupLayer):
      • sp_ids=<tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7fc34c6824c0>
      • sp_weights=None
      • name=None
      • combiner='mean'
      • max_norm=None
    

    Environment (please complete the following information):

    • OS: CentOS-7
    • DGX-A100
    • CUDA version: CUDA 11.7 on host os.
    • Docker image: nvcr.io/nvidia/merlin/merlin-tensorflow:22.09

    Additional context This is the only multi-gpu example of hps tensorflow plugin. Is there any detailed guidence of deploying hps tensorflow plugin under multi-gpu environment?

    opened by Jeffery-Song 2
  • [BUG] Segmentation fault using cudf after HugeCTR model run

    [BUG] Segmentation fault using cudf after HugeCTR model run

    Describe the bug

    After instantiating a HugeCTR model and attempting to use cudf after, encounter a segmentation fault.

    The scenario we've encountered this is in two separate pytest test functions. The first is one that creates a HugeCTR model (which runs ok). The second is something else that uses cudf. This test passes in isolation. However, if the test involving HugeCTR is run first, the second test fails with a segmenataion fault.

    Presumably there is some global state that needs to be cleaned up that is not automatically happening as a result of the model reference going out-of-scope.

    Steps To Reproduce

    1. Instantiate a HugeCTR model and fit it on a dataset
    2. delete the model object (pytest cleanup)
    3. (in the same python session) use cudf.DataFrame.from_pandas(x)
    4. Segmentation fault encoutered in a from_arrow function in cudf
    import hugectr
    import cudf
    
    def test_hugectr_model():
        model = hugectr.Model(...)
        ...
        model.compile(...)
        model.fit(...)
    
    
    def test_something_else():
        ....
        cudf.DataFrame.from_pandas(x)
    
    

    Expected behavior

    No segmentation fault calling cudf in a context where the HugeCTR model should be cleaned up.

    Additional context

    I don't currently have an environment where I can reproduce this locally (#337 ), so the reproduce steps are potentially not the most minimal.

    Examples of this segmentation fault can be found in the nvidia-merlin-bot comments on this PR https://github.com/NVIDIA-Merlin/systems/pull/129

    bug P1 
    opened by oliverholworthy 3
  • [Requirement]Profiling operations for HugeCTR

    [Requirement]Profiling operations for HugeCTR

    Hi HugeCTR team,

    Recently I have used Nsight to profile a model which uses HugeCTR.

    Unlike another tool, DLProf, which gives an operation break down for the model, I found the result from Nsight is very low level and it is quite difficult to find out what's the total time of each operation is.

    I am wondering is there a way to get a high level operation profile for HugeCTR model?

    fea::functional P1 requirement 
    opened by regnnighe 3
Releases(v4.2)
  • v4.2(Nov 15, 2022)

    What's New in Version 4.2

    In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.2`.
    Afterward, the library will use calendar versioning only, such as `v23.01`.
    
    • Change to HPS with Redis or Kafka: This release includes a change to Hierarchical Parameter Server and affects deployments that use RedisClusterBackend or model parameter streaming with Kafka. A third-party library that was used for HPS partition selection algorithm is replaced to improve performance. The new algorithm can produce different partition assignments for volatile databases. As a result, volatile database backends that retain data between application startup, such as the RedisClusterBackend, must be reinitialized. Model streaming with Kafka is equally affected. To avoid issues with updates, reset all respective queue offsets to the end_offset before you reinitialize the RedisClusterBackend.

    • Enhancements to the Sparse Operation Kit in DeepRec: This release includes updates to the Sparse Operation Kit to improve the performance of the embedding variable lookup operation in DeepRec. The API for the lookup_sparse() function is changed to remove the hotness argument. The lookup_sparse() function is enhanced to calculate the number of non-zero elements dynamically. For more information, refer to the sparse_operation_kit directory of the DeepRec repository in GitHub.

    • Enhancements to 3G Embedding: This release includes the following enhancements to 3G embedding:

      • The API is changed. The EmbeddingPlanner class is replaced with the EmbeddingCollectionConfig class. For examples of the API, see the tests in the test/embedding_collection_test directory of the repository in GitHub.
      • The API is enhanced to support dumping and loading weights during the training process. The methods are Model.embedding_dump(path: str, table_names: list[str]) and Model.embedding_load(path: str, list[str]). The path argument is a directory in file system that you can dump weights to or load weights from. The table_names argument is a list of embedding table names as strings.
    • New Volatile Database Type for HPS: This release adds a db_type value of multi_process_hash_map to the Hierarchical Parameter Server. This database type supports sharing embeddings across process boundaries by using shared memory and the /dev/shm device file. Multiple processes running HPS can read and write to the same hash map. For an example, refer to the Hierarchcal Parameter Server Demo notebook.

    • Enhancements to the HPS Redis Backend: In this release, the Hierarchical Parameter Server can open multiple connections in parallel to each Redis node. This enhancement enables HPS to take advantage of overlapped processing optimizations in the I/O module of Redis servers. In addition, HPS can now take advantage of Redis hash tags to co-locate embedding values and metadata. This enhancement can reduce the number of accesses to Redis nodes and the number of per-node round trip communications that are needed to complete transactions. As a result, the enhancement increases the insertion performance.

    • MLPLayer is New: This release adds an MLP layer with the hugectr.Layer_t.MLP class. This layer is very flexible and makes it easier to use a group of fused fully-connected layers and enable the related optimizations. For each fused fully-connected layer in MLPLayer, the output dimension, bias, and activation function are all adjustable. MLPLayer supports FP32, FP16 and TF32 data types. For an example, refer to the dgx_a100_mlp.py in the samples/dlrm directory of the GitHub repository to learn how to use the layer.

    • Sparse Operation Kit installable from PyPi: Version 1.1.4 of the Sparse Operation Kit is installable from PyPi in the merlin-sok package.

    • Multi-task Model Support added to the ONNX Model Converter: This release adds support for multi-task models to the ONNX converter. This release also includes an enhancement to the preprocess_census.py script in samples/mmoe directory of the GitHub repository.

    • Issues Fixed:

      • Using the HPS Plugin for TensorFlow with MirroredStrategy and running the Hierarchical Parameter Server Demo notebook triggered an issue with ReplicaContext and caused a crash. The issue is fixed and resolves GitHub issue #362.
      • The 4_nvt_process.py sample in the samples/din/utils directory of the GitHub repository is updated to use the latest NVTabular API. This update resolves GitHub issue #364.
      • An illegal memory access related to 3G embedding and the dgx_a100_ib_nvlink.py sample in the samples/dlrm directory of the GitHub repository is fixed.
      • An error in HPS with the lookup_fromdlpack() method is fixed. The error was related to calculating the number of keys and vectors from the corresponding DLPack tensors.
      • An error in the HugeCTR backend for Triton Inference Server is fixed. A crash was triggered when the initial size of the embedding cache is smaller than the allowed minimum size.
      • An error related to using a ReLU layer with an odd input size in mixed precision mode could trigger a crash. The issue is fixed.
      • An error related to using an asynchronous reader with the AsyncParam class and specifying an io_alignment value that is smaller than the block device sector size is fixed. Now, if the specified io_alignment value is smaller than the block device sector size, io_alignment is automatically set to the block device sector size.
      • Unreported memory leaks in the GRU layer and collectives are fixed.
      • Several broken documentation links related to HPS are fixed.
    • Known Issues:

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

      • Dumping Adam optimizer states to AWS S3 is not supported.

    Source code(tar.gz)
    Source code(zip)
  • v4.1.1(Nov 2, 2022)

    What's New in Version 4.1.1

    • Simplified Interface for 3G Embedding Table Placement Strategy: 3G embedding now provides an easier way for you to configure an embedding table placement strategy. Instead of using JSON, you can configure the embedding table placement strategy by using function arguments. You only need to provide the shard_matrix, table_group_strategy, and table_placement_strategy arguments. With these arguments, 3G embedding can group different tables together and place them according to the shard_matrix argument. For an example, refer to dlrm_train.py file in the test/embedding_collection_test directory of the repository on GitHub. For comparison, refer to the same file from the v4.0 branch of the repository.

    • New MMoE and Shared-Bottom Samples: This release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation. For more information, refer to the README.md, mmoe_parquet.py, and other files in the samples/mmoe directory of the repository on GitHub. This release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE.

    • Support for AWS S3 File System: The Parquet DataReader can now read datasets from the Amazon Web Services S3 file system. You can also load and dump models from and to S3 during training. The documentation for the DataSourceParams class is updated. To view sample code, refer to the HugeCTR Training with Remote File System Example class is updated.

    • Simplication for File System Usage: You no longer ’t need to pass DataSourceParams for model loading and dumping. The FileSystem class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model. For example, the path hdfs://localhost:9000/ is inferred as an HDFS file system and the path https://mybucket.s3.us-east-1.amazonaws.com/ is inferred as an S3 file system.

    • Support for Loading Models from Remote File Systems to HPS: This release enables you to load models from HDFS and S3 remote file systems to HPS during inference. To use the new feature, specify an HDFS for S3 path URI in InferenceParams.

    • Support for Exporting Intermediate Tensor Values into a Numpy Array: This release adds function check_out_tensor to Model and InferenceModel. You can use this function to check out the intermediate tensor values using the Python interface. This function is especially helpful for debugging. For more information, refer to Model.check_out_tensor and InferenceModel.check_out_tensor.

    • On-Device Input Keys for HPS Lookup: The HPS lookup supports input embedding keys that are on GPU memory during inference. This enhancement removes a host-to-device copy by using the DLPack lookup_fromdlpack() interface. By using the interface, the input DLPack capsule of embedding key can be a GPU tensor.

    • Documentation Enhancements:

    • Issues Fixed:

      • The InteractionLayer class is fixed so that it works correctly with num_feas > 30.
      • The cuBLASLt configuration is corrected by increasing the workspace size and adding the epilogue mask.
      • The NVTabular based preprocessing script for our samples that demonstrate feature crossing is fixed.
      • The async data reader is fixed. Previously, it would hang and cause a corruption issue due to an improper I/O block size and I/O alignment problem. The AsyncParam class is changed to implement the fix. The io_block_size argument is replaced by the max_nr_request argument and the actual I/O block size that the async reader uses is computed accordingly. For more information, refer to the AsyncParam class documentation.
    • Known Issues:

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

      • Dumping Adam optimizer states to AWS S3 is not supported.

    Source code(tar.gz)
    Source code(zip)
  • v4.1(Oct 17, 2022)

    What's New in Version 4.1

    • Simplified Interface for 3G Embedding Table Placement Strategy: 3G embedding now provides an easier way for you to configure an embedding table placement strategy. Instead of using JSON, you can configure the embedding table placement strategy by using function arguments. You only need to provide the shard_matrix, table_group_strategy, and table_placement_strategy arguments. With these arguments, 3G embedding can group different tables together and place them according to the shard_matrix argument. For an example, refer to dlrm_train.py file in the test/embedding_collection_test directory of the repository on GitHub. For comparison, refer to the same file from the v4.0 branch of the repository.

    • New MMoE and Shared-Bottom Samples: This release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation. For more information, refer to the README.md, mmoe_parquet.py, and other files in the samples/mmoe directory of the repository on GitHub. This release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE.

    • Support for AWS S3 File System: The Parquet DataReader can now read datasets from the Amazon Web Services S3 file system. You can also load and dump models from and to S3 during training. The documentation for the DataSourceParams class is updated. To view sample code, refer to the HugeCTR Training with Remote File System Example class is updated.

    • Simplication for File System Usage: You no longer ’t need to pass DataSourceParams for model loading and dumping. The FileSystem class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model. For example, the path hdfs://localhost:9000/ is inferred as an HDFS file system and the path https://mybucket.s3.us-east-1.amazonaws.com/ is inferred as an S3 file system.

    • Support for Loading Models from Remote File Systems to HPS: This release enables you to load models from HDFS and S3 remote file systems to HPS during inference. To use the new feature, specify an HDFS for S3 path URI in InferenceParams.

    • Support for Exporting Intermediate Tensor Values into a Numpy Array: This release adds function check_out_tensor to Model and InferenceModel. You can use this function to check out the intermediate tensor values using the Python interface. This function is especially helpful for debugging. For more information, refer to Model.check_out_tensor and InferenceModel.check_out_tensor.

    • On-Device Input Keys for HPS Lookup: The HPS lookup supports input embedding keys that are on GPU memory during inference. This enhancement removes a host-to-device copy by using the DLPack lookup_fromdlpack() interface. By using the interface, the input DLPack capsule of embedding key can be a GPU tensor.

    • Documentation Enhancements:

    • Issues Fixed:

      • The InteractionLayer class is fixed so that it works correctly with num_feas > 30.
      • The cuBLASLt configuration is corrected by increasing the workspace size and adding the epilogue mask.
      • The NVTabular based preprocessing script for our samples that demonstrate feature crossing is fixed.
      • The async data reader is fixed. Previously, it would hang and cause a corruption issue due to an improper I/O block size and I/O alignment problem. The AsyncParam class is changed to implement the fix. The io_block_size argument is replaced by the max_nr_request argument and the actual I/O block size that the async reader uses is computed accordingly. For more information, refer to the AsyncParam class documentation.
    • Known Issues:

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

      • Dumping Adam optimizer states to AWS S3 is not supported.

    Source code(tar.gz)
    Source code(zip)
  • v4.0(Sep 14, 2022)

    What's New in Version 4.0

    • 3G Embedding Stablization: Since the introduction of the next generation of HugeCTR embedding in v3.7, several updates and enhancements were made, including code refactoring to improve usability. The enhancements for this release are as follows:

      • Optimized the performance for sparse lookup in terms of inter-warp load imbalance. Sparse Operation Kit (SOK) takes advantage of the enhancement to improve performance.
      • This release includes a fix for determining the maximum embedding vector size in the GlobalEmbeddingData and LocalEmbeddingData classes.
      • Version 1.1.4 of Sparse Operation Kit can be installed with Pip and includes the enhancements mentioned in the preceding bullets.
    • Embedding Cache Initialization with Configurable Ratio: In previous releases, the default value for the cache_refresh_percentage_per_iteration parameter of the InferenceParams was 0.1.

      In this release, default value is 0.0 and the parameter provides an additional purpose. If you set the parameter to a value greater than 0.0 and also set use_gpu_embedding_cache to True for a model, when Hierarchical Parameter Server (HPS) starts, HPS initializes the embedding cache for the model on the GPU by loading a subset of the embedding vectors from the sparse files for the model. When embedding cache initialization is used, HPS creates log records when it starts at the INFO level. The logging records are similar to EC initialization for model: "<model-name>", num_tables: <int> and EC initialization on device: <int>. This enhancement reduces the duration of the warm up phase.

    • Lazy Initialization of HPS Plugin for TensorFlow: In this release, when you deploy a SavedModel of TensorFlow with Triton Inference Server, HPS is implicitly initialized when the loaded model is executed for the first time. In previous releases, you needed to run hps.Init(ps_config_file, global_batch_size) explicitly. For more information, see the API documentation for hierarchical_parameter_server.Init.

    • Enhancements to the HDFS Backend:

      • The HDFS Backend is now called IO::HadoopFileSystem.
      • This release includes fixes for memory leaks.
      • This release includes refactoring to generalize the interface for HDFS and S3 as remote filesystems.
      • For more information, see hadoop_filesystem.hpp in the include/io directory of the repository on GitHub.
    • Dependency Clarification for Protobuf and Hadoop: Hadoop and Protobuf are true third_party modules now. Developers can now avoid unnecessary and frequent cloning and deletion.

    • Finer granularity control for overlap behavior: We deperacated the old overlapped_pipeline knob and introduces four new knobs train_intra_iteration_overlap/train_inter_iteration_overlap/eval_intra_iteration_overlap/eval_inter_iteration_overlap to help user better control the overlap behavior. For more information, see the API documentation for Solver.CreateSolver

    • Documentation Improvements:

      • Removed two deprecated tutorials triton_tf_deploy and dump_to_tf.
      • Previously, the graphics in the Performance page did not appear. This issue is fixed in this release.
      • Previously, the API documentation for the HPS Plugin for TensorFlow did not show the class information. This issue is fixed in this release.
    • Issues Fixed:

      • Fixed a build error that was triggered in debug mode. The error was caused by the newly introduced 3G embedding unit tests.
      • When using the Parquet DataReader, if a parquet dataset file specified in metadata.json does not exist, HugeCTR no longer crashes. The new behavior is to skip the missing file and display a warning message. This change relates to GitHub issue 321.
    • Known Issues:

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

    Source code(tar.gz)
    Source code(zip)
  • v3.9.1(Sep 8, 2022)

  • v3.9(Aug 23, 2022)

    What's New in Version 3.9

    • Updates to 3G Embedding:

      • Sparse Operation Kit (SOK) is updated to use the HugeCTR 3G embedding as a developer preview feature. For more information, refer to the Python programs in the sparse_operation_kit/experiment/benchmark/dlrm directory of the repository on GitHub.
      • Dynamic embedding table mode is added. The mode is based on the cuCollection with some functionality enhancement. A dynamic embedding table grows its size when the table is full so that you no longer need to configure the memory usage information for embedding. For more information, refer to the embedding_storage/dynamic_embedding_storage directory of the repository on GitHub.
    • Enhancements to the HPS Plugin for TensorFlow: This release includes improvements to the interoperability of SOK and HPS. The plugin now supports the sparse lookup layer. The documentation for the HPS plugin is enhanced as follows:

    • Enhancements to the HPS Backend for Triton Inference Server This release adds support for integrating the HPS Backend and the TensorFlow Backend through the ensemble mode with Triton Inference Server. The enhancement enables deploying a TensorFlow model with large embedding tables with Triton by leveraging HPS. For more information, refer to the sample programs in the hps-triton-ensemble directory of the HugeCTR Backend repository in GitHub.

    • New Multi-Node Tutorial: The multi-node training tutorial is new. The additions show how to use HugeCTR to train a model with multiple nodes and is based on our most recent Docker container. The tutorial should be useful to users who do not have a job-scheduler-installed cluster such as Slurm Workload Manager. The update addresses a issue that was first reported in GitHub issue 305.

    • Support Offline Inference for MMoE: This release includes MMoE offline inference where both per-class AUC and average AUC are provided. When the number of class AUCs is greater than one, the output includes a line like the following example:

      [HCTR][08:52:59.254][INFO][RK0][main]: Evaluation, AUC: {0.482141, 0.440781}, macro-averaging AUC: 0.46146124601364136
      
    • Enhancements to the API for the HPS Database Backend This release includes several enhancements to the API for the DatabaseBackend class. For more information, see database_backend.hpp and the header files for other database backends in the HugeCTR/include/hps directory of the repository. The enhancments are as follows:

      • You can now specify a maximum time budget, in nanoseconds, for queries so that you can build an application that must operate within strict latency limits. Fetch queries return execution control to the caller if the time budget is exhausted. The unprocessed entries are indicated to the caller through a callback function.
      • The dump and load_dump methods are new. These methods support saving and loading embedding tables from disk. The methods support a custom binary format and the RocksDB SST table file format. These methods enable you to import and export embedding table data between your custom tools and HugeCTR.
      • The find_tables method is new. The method enables you to discover all table data that is currently stored for a model in a DatabaseBackend instance. A new overloaded method for evict is added that can process the results from find_tables to quickly and simply drop all the stored information that is related to a model.
    • Documentation Enhancements

      • The documentation for the max_all_to_all_bandwidth parameter of the HybridEmbeddingParam class is clarified to indicate that the bandwidth unit is per-GPU. Previously, the unit was not specified.
    • Issues Fixed:

      • Hybrid embedding with IB_NVLINK as the communication_type of the HybridEmbeddingParam is fixed in this release.
      • Training performance is affected by a GPU routine that checks if an input key can be out of the embedding table. If you can guarantee that the input keys can work with the specified workspace_size_per_gpu_in_mb, we have a workaround to disable the routine by setting the environment variable HUGECTR_DISABLE_OVERFLOW_CHECK=1. The workaround restores the training performance.
      • Engineering discovered and fixed a correctness issue with the Softmax layer.
      • Engineering removed an inline profiler that was rarely used or updated. This change relates to GitHub issue 340.
    • Known Issues:

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

    Source code(tar.gz)
    Source code(zip)
  • v3.8(Jul 14, 2022)

    What's New in Version 3.8

    • Sample Notebook to Demonstrate 3G Embedding: This release includes a sample notebook that introduces the Python API of the embedding collection and the key concepts for using 3G embedding. You can view HugeCTR Embedding Collection from the documentation or access the embedding_collection.ipynb file from the notebooks directory of the repository.

    • DLPack Python API for Hierarchical Parameter Server Lookup: This release introduces support for embedding lookup from the Hierarchical Parameter Server (HPS) using the DLPack Python API. The new method is lookup_fromdlpack(). For sample usage, see the Lookup the Embedding Vector from DLPack heading in the "Hierarchical Parameter Server Demo" notebook.

    • Read Parquet Datasets from HDFS with the Python API: This release enhances the DataReaderParams class with a data_source_params argument. You can use the argument to specify the data source configuration such as the host name of the Hadoop NameNode and the NameNode port number to read from HDFS.

    • Logging Performance Improvements: This release includes a performance enhancement that reduces the performance impact of logging.

    • Enhancements to Layer Classes:

      • The FullyConnected layer now supports 3D inputs
      • The MatrixMultiply layer now supports 4D inputs.
    • Documentation Enhancements:

    • Issues Fixed:

      • The data generator for the Parquet file type is fixed and produces consistent file names between the _metadata.json file and the actual dataset files. Previously, running the data generator to create synthetic data resulted in a core dump. This issue was first reported in the GitHub issue 321.
      • Fixed the memory crash in running a large model on multiple GPUs that occurred during AUC warm up.
      • Fixed the issue of keyset generation in the ETC notebook. Refer to the GitHub issue 332 for more details.
      • Fixed the inference build error that occurred when building with debug mode.
      • Fixed the issue that multi-node training prints duplicate messages.
    • Known Issues:

      • Hybrid embedding with IB_NVLINK as the communication_type of the HybridEmbeddingParam class does not work currently. We are working on fixing it. The other communication types have no issues.

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

    Source code(tar.gz)
    Source code(zip)
  • v3.7(Jun 16, 2022)

    What's New in Version 3.7

    • 3G Embedding Developer Preview: The 3.7 version introduces next-generation of embedding as a developer preview feature. We call it 3G embedding because it is the new update to the HugeCTR embedding interface and implementation since the unified embedding in v3.1 version, which was the second one. Compared with the previous embedding, there are three main changes in the embedding collection.

      • First, it allows users to fuse embedding tables with different embedding vector sizes. The previous embedding can only fuse embedding tables with the same embedding vector size. The enhancement boosts both flexibility and performance.
      • Second, it extends the functionality of embedding by supporting the concat combiner and supporting different slot lookup on the same embedding table.
      • Finally, the embedding collection is powerful enough to support arbitrary embedding table placement which includes data parallel and model parallel. By providing a plan JSON file, you can configure the table placement strategy as you specify. See the dlrm_train.py file in the embedding_collection_test directory of the repository for a more detailed usage example.
    • HPS Performance Improvements:

      • Kafka: Model parameters are now stored in Kafka in a bandwidth-saving multiplexed data format. This data format vastly increases throughput. In our lab, we measured transfer speeds up to 1.1 Gbps for each Kafka broker.
      • HashMap backend: Parallel and single-threaded hashmap implementations have been replaced by a new unified implementation. This new implementation uses a new memory-pool based allocation method that vastly increases upsert performance without diminishing recall performance. Compared with the previous implementation, you can expect a 4x speed improvement for large-batch insertion operations.
      • Suppressed and simplified log: Most log messages related to HPS have the log level changed to TRACE rather than INFO or DEBUG to reduce logging verbosity.
    • Offline Inference Usability Enhancements:

      • The thread pool size is configurable in the Python interface, which is useful for studying the embedding cache performance in scenarios of asynchronous update. Previously it was set as the minimum value of 16 and std::thread::hardware_concurrency(). For more information, please refer to Hierarchical Parameter Server Configuration.
    • DataGenerator Performance Improvements: You can specify the num_threads parameter to parallelize a Norm dataset generation.

    • Evaluation Metric Improvements:

      • Average loss performance improvement in multi-node environments.
      • AUC performance optimization and safer memory management.
      • Addition of NDCG and SMAPE.
    • Embedding Training Cache Parquet Demo: Created a keyset extractor script to generate keyset files for Parquet datasets. Provided users with an end-to-end demo of how to train a Parquet dataset using the embedding cache mode. See the Embedding Training Cache Example notebook.

    • Documentation Enhancements: The documentation details for HugeCTR Hierarchical Parameter Server Database Backend are updated for consistency and clarity.

    • Issues Fixed:

      • If slot_size_array is specified, workspace_size_per_gpu_in_mb is no longer required.
      • If you build and install HugeCTR from scratch, you can specify the CMAKE_INSTALL_PREFIX CMake variable to identify the installation directory for HugeCTR.
      • Fixed SOK hang issue when calling sok.Init() with a large number of GPUs. See the GitHub issue 261 and 302 for more details.
    • Known Issues:

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

      • The Criteo 1 TB click logs dataset that is used with many HugeCTR sample programs and notebooks is currently unavailable. Until the dataset becomes downloadable again, you can run those samples based on our synthetic dataset generator. For more information, see the Getting Started section of the repository README file.

      • Data generator of parquet type produces inconsistent file names between _metadata.json and actual dataset files, which will result in core dump fault when using the synthetic dataset.

    Source code(tar.gz)
    Source code(zip)
  • v3.6(May 11, 2022)

    What's New in Version 3.6

    • Concat 3D Layer: In previous releases, the Concat layer could handle two-dimensional (2D) input tensors only. Now, the input can be three-dimensional (3D) and you can concatenate the inputs along axis 1 or 2. For more information, see the API documentation for the Concat Layer.

    • Dense Column List Support in Parquet DataReader: In previous releases, HugeCTR assumes each dense feature has a single value and it must be the scalar data type float32. Now, you can mix float32 or list[float32] for dense columns. This enhancement means that each dense feature can have more than one value. For more information, see the API documentation for the Parquet dataset format.

    • Support for HDFS is Re-enabled in Merlin Containers: Support for HDFS in Merlin containers is an optional dependency now. For more information, see HDFS Support.

    • Evaluation Metric Enhancements: In previous releases, HugeCTR computes AUC for binary classification only. Now, HugeCTR supports AUC for multi-label classification. The implementation is inspired by sklearn.metrics.roc_auc_score and performs the unweighted macro-averaging strategy that is the default for scikit-learn. You can specify a value for the label_dim parameter of the input layer to enable multi-label classification and HugeCTR will compute the multi-label AUC.

    • Log Output Format Change: The default log format now includes milliseconds.

    • Documentation Enhancements:

      • These release notes are included in the documentation and are available at https://nvidia-merlin.github.io/HugeCTR/v3.6/release_notes.html.
      • The Configuration section of the Hierarchical Parameter Server information is updated with more information about the parameters in the configuration file.
      • The example notebooks that demonstrate how to work with multi-modal data are reorganized in the navigation. The notebooks are now available under the heading Multi-Modal Example Notebooks. This change is intended to make it easier to find the notebooks.
      • The documentation in the sparse_operation_kit directory of the repository on GitHub is updated with several clarifications about SOK.
    • Issues Fixed:

      • The dlrm_kaggle_fp32.py file in the samples/dlrm/ directory of the repository is updated to show the correct number of samples. The num_samples value is now set to 36672493. This fixes GitHub issue 301.
      • Hierarchical Parameter Server (HPS) would produce a runtime error when the GPU cache was turned off. This issue is now fixed.
    • Known Issues:

      • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

          -shm-size=1g -ulimit memlock=-1
        

        See also the NCCL known issue and the GitHub issue.

      • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

      • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

      • Joint loss training with a regularizer is not supported.

      • The Criteo 1 TB click logs dataset that is used with many HugeCTR sample programs and notebooks is currently unavailable. Until the dataset becomes downloadable again, you can run those samples based on our synthetic dataset generator. For more information, see the Getting Started section of the repository README file.

    Source code(tar.gz)
    Source code(zip)
  • v3.5(Apr 1, 2022)

    What's New in Version 3.5

    • HPS interface encapsulation and exporting as library: We encapsulate the Hierarchical Parameter Server(HPS) interfaces and deliver it as a standalone library. Besides, we prodvide HPS Python APIs and demonstrate the usage with a notebook. For more information, please refer to Hierarchical Parameter Server and HPS Demo.

    • Hierarchical Parameter Server Triton Backend: The HPS Backend is a framework for embedding vectors looking up on large-scale embedding tables that was designed to effectively use GPU memory to accelerate the looking up by decoupling the embedding tables and embedding cache from the end-to-end inference pipeline of the deep recommendation model. For more information, please refer to Hierarchical Parameter Server.

    • SOK pip release: SOK pip releases on https://pypi.org/project/merlin-sok/. Now users can install SOK via pip install merlin-sok.

    • Joint loss and multi-tasks training support:: We support joint loss in training so that users can train with multiple labels and tasks with different weights. MMoE sample is added to show the usage here.

    • HugeCTR documentation on web page: Now users can visit our web documentation.

    • ONNX converter enhancement:: We enable converting MultiCrossEntropyLoss and CrossEntropyLoss layers to ONNX to support multi-label inference. For more information, please refer to HugeCTR to ONNX Converter.

    • HDFS python API enhancement:

      • Simplified DataSourceParams so that users do not need to provide all the paths before they are really necessary. Now users only have to pass DataSourceParams once when creating a solver.
      • Later paths will be automatically regarded as local paths or HDFS paths depending on the DataSourceParams setting. See notebook for usage.
    • HPS performance optimization: We use better method to determine partition number in database backends in HPS.

    • Bug fixing:

      • HugeCTR input layer now can take dense_dim greater than 1000.
    Source code(tar.gz)
    Source code(zip)
  • v3.4.1(Mar 1, 2022)

    What's New in Version 3.4.1

    • Support mixed precision inference for dataset with multiple labels: We enable FP16 for the Softmax layer and support mixed precision for multi-label inference. For more information, please refer to Inference API.

    • Support multi-GPU offline inference with Python API: We support multi-GPU offline inference with the Python interface, which can leverage Hierarchical Parameter Server and enable concurrent execution on multiple devices. For more information, please refer to Inference API and Multi-GPU Offline Inference Notebook.

    • Introduction to metadata.json: We add the introduction to _metadata.json for Parquet datasets. For more information, please refer to Parquet.

    • Documents and tool for workspace size per GPU estimation: we add a tool named embedding_workspace_calculator to help calculate workspace_size_per_gpu_in_mb required by hugectr.SparseEmbedding. For more information, please refer to embedding_workspace_calculator/README.md and QA 24.

    • Improved Debugging Capability: The old logging system, which was flagged as deprecated for some time has been removed. All remaining log messages and outputs have been revised and migrated to the new logging system (base/debug/logging.hpp/cpp). During this revision, we also adjusted log levels for log messages throughout the entire codebase to improve visibility of relevant information.

    • Support HDFS Parameter Server in Training:

      • Decoupled HDFS in Merlin containers to make the HDFS support more flexible. Users can now compile HDFS related functionalities optionally.
      • Now supports loading and dumping models and optimizer states from HDFS.
      • Added a notebook to show how to use HugeCTR with HDFS.
    • Support Multi-hot Inference on Hugectr Backend: We support categorical input in multi-hot format for HugeCTR Backend inference.

    • Multi-label inference with mixed precision: Mixed precision training is enabled for softmax layer.

    • Python Script and documentation demonstrating how to analyze model files: In this release, we provide a script to retreive vocabulary information from model file. Please find more details on the readme

    • Bug Fixing:

      • Mirror strategy bug in SOK (see in https://github.com/NVIDIA-Merlin/HugeCTR/issues/291)
      • Can't import sparse operation kit in nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.03 (see in https://github.com/NVIDIA-Merlin/HugeCTR/issues/296)
      • HPS: Fixed access violation that can occur during initialization when not configuring a volatile DB.

    Known Issues

    • HugeCTR uses NCCL to share data between ranks, and NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing: -shm-size=1g -ulimit memlock=-1 See also NCCL's known issue. And the GitHub issue.

    • KafkaProducers startup will succeed, even if the target Kafka broker is unresponsive. In order to avoid data-loss in conjunction with streaming model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers is up, working properly and reachable from the node where you run HugeCTR.

    • The number of data files in the file list should be no less than the number of data reader workers. Otherwise, different workers will be mapped to the same file and data loading does not progress as expected.

    Source code(tar.gz)
    Source code(zip)
  • v3.4(Jan 28, 2022)

    What's New in Version 3.4

    • Supporting HugeCTR Development with Merlin Unified Container: From Merlin v22.02 we encourage you to develop HugeCTR under Merlin Unified Container (release container) according to the instructions in Contributor Guide to keep consistent.

    • Hierarchical Parameter Server (HPS) Enhancements:

      • Missing key insertion feature: Via a simple flag, it is now possible to configure HugeCTR such that missed embedding-table entries during lookup are automatically inserted into volatile database layers such as the Redis and Hashmap backends.
      • Asynchronous timestamp refresh: In the last release we introduced the passing-of-time-aware eviction policies. These are policies that are applied to shrink database partitions through dropping keys if they grow beyond certain limits. However, the time-information utilized by these eviction policies represented the update time. Hence, an embedding was evicted based on the time passed since its last update. If you operate HugeCTR in inference mode, the embedding table is typically immutable. With the above-described missing key insertion feature we now support actively tuning the contents of volatile database layers to the data distribution during lookup. To allow time-based eviction to take place, it is now possible to enable timestamp refreshing for frequently used embeddings. Once enabled, refreshing is handled asynchronously using background threads. Hence, it won’t block your inference jobs. For most applications, the associated performance impact from enabling this feature is barely noticeable.
      • Support HDFS(Hadoop Distributed File System) Parameter Server in Training:
        • A new Python API DataSourceParams used to specify the file system and paths to data and model files.
        • Support loading data from HDFS to the local file system for HugeCTR training.
        • Support dumping trained model and optimizer states into HDFS.
      • Online seamless update of the parameters of the dense part of the model: HugeCTR Backend has supported online model version updating by the Load API of Triton (including the seamless update of the dense part and corresponding embedding inference cache for the same model), and the Load API is still fully compatible with online deployment of new models.
    • Sparse Operation Kit Enhancements:

      • Mixed Precision Training: Mixed precision training can be enabled via TF’s pattern to enhance the training performance and lessen memory usage.
      • DLRM Benchmark: DLRM is a standard benchmark for recommendation model training. A note book is added to address the performance of SOK on this benchmark in this release.
      • Support Uint32_t / int64_t key dtype in SOK: Int64 or uint32 can be used as the key data type for SOK’s embedding. By default, it is int64.
      • Add TensorFlow initializers support: Tensorflow native initializer can be used in SOK now. e.g. sok.All2AllDenseEmbedding(embedding_initializer=tf.keras.initializers.RandomUniform())
    • User Experience Enhancements

      • We have revised several notebooks and readme files to clarify instructions and make HugeCTR more accessible in general.
      • Thanks to GitHub user @MuYu-zhi , who brought to our attention that having configured too few shared memory can impact the proper operation of HugeCTR. We extended the SOK docker setup instructions to address how such issues can be resolved using the --shm-size setting of docker.
      • Although HugeCTR is designed for scalability, having a beefy machine is not necessary for smaller workloads and testing. We added information about the required specs for notebook testing environments in README.
    • Inference for Multi-tasking: We support HugeCTR inference for multiple tasks. When the label dimension is the number of binary classification tasks and MultiCrossEntropyLoss is employed during training, the shape of inference results will be (batch_size*num_batches, label_dim). For more information, please refer to Inference API.

    • Fix the Embedding Cache Issue for Super Small Embedding Tables

    Source code(tar.gz)
    Source code(zip)
  • v3.3.1(Jan 11, 2022)

    What's New in Version 3.3.1

    • Hierarchical Parameter Server Enhancements:
      • Online deployment of new models and recycling of old models: In this release, HugeCTR Backend is fully compatible with the model control protocol of Triton. Adding the configuration of a new model to the HPS configuration file. The HugeCTR Backend has supported online deployment of new models by the Load API of Triton. The old models can also be recycled online by the Unload API.
      • Simplified database backend: Multi-nodes, single-node and all other kinds of volatile database backends can now be configured using the same configuration object.
      • Multi-threaded optimization of Redis code: ~2.3x speedup up over HugeCTR v3.3.
      • Fix to some issues: Build HPS test environment and implement unit test of each component; Access violation issue of online Kafka updates; Parquet data reader incorrectly parses the index of categorical features in the case of multiple embedded tables; HPS Redis Backend overflow handling not invoked upon single insertions.
    • New group of fused fully connected layers: We support adding a group of fused fully connected layers when constructing the model graph. A concise Python interface is provided for users to adjust the number of layers, as well as to specify the output dimensions in each layer, which makes it easy to leverage the highly-optimized fused fully connected layer in HugeCTR. For more information, please refer to GroupDenseLayer
    • Fix to some issues:
      • Warnning is added for the case users forget to import mpi before launching multi-process job
      • Removing massive log when runing with embedding training cache
      • Removing lagacy conda related informations from documents
    Source code(tar.gz)
    Source code(zip)
  • v3.3(Dec 7, 2021)

    What's New in Version 3.3

    • Hierarchical Parameter Server:

      • Support Incremental Models Updating From Online Training: HPS now supports iterative model updating via Kafka message queues. It is now possible to connect HugeCTR with Apache Kafka deployments to update the model in-place in real-time. This feature is supported in both phases, training and inference. Please refer to the Demo Notebok.
      • Support Embedding keys Eviction Mechanism: In-memory databases such as Redis or CPU memory backed storage are used now as the feature memory management. Hence, when performing iterative updating, they will automatically evict infrequently used embeddings as training progresses.
      • Support Embedding Cache Asynchronous Refresh Mechanism: We have supported the asynchronous refreshing of incremental embedding keys into the embedding cache. Refresh operation will be triggered when completing the model version iteration or incremental parameters output from online training. The Distributed Database and Persistent Database will be updated by the distributed event streaming platform(Kafka). And then the GPU embedding cache will refresh the values of the existing embedding keys and replace them with the latest incremental embedding vectors. Please refer to the HPS README.
      • Other Improvements: Backend implementations for databases are now fully configurable. JSON interface parser can cope better with inaccurate parameterization. Less and if (hopefully) more meaningful jabber! Based on your requests, we revised the log levels for throughout the entire database backend API of the parameter server. Selected configuration options are now printed wholesomely and uniformly to the log. Errors provide more verbose information on the matter at hand. Improved performance of Redis cluster backend. Improved performance of CPU memory database backend.
    • SOK TF 1.15 Support: In this version, SOK can be used along with TensorFlow 1.15. See README. Dedicated CUDA stream is used for SOK’s Ops, and kernel interleaving might be eliminated. Users can now install SOK via pip install SparseOperationKit, which no longer requires root access to compile SOK and no need to copy python scripts. There was a hanging issue in tf.distribute.MirroredStrategy when TensorFlow version greater than 2.4. In this version, this issue in TensorFlow 2.5+ is fixed.

    • MLPerf v1.1 integration

      • Hybrid-embedding indices pre-computing:The indices needed for hybrid embedding are pre-computed ahead of time and are overlapped with previous iterations.
      • Cached evaluation indices::The hybrid-embedding indices for eval are cached when applicable, hence eliminating the re-computing of the indices at every eval iteration.
      • MLP weight/data gradients calculation overlap::The weight gradients of MLP are calculated asynchronously with respect to the data gradients, enabling overlap between these two computations.
      • Better compute-communication overlap::Better overlap between compute and communication has been enabled to improve training throughput.
      • Fused weight conversion::The FP32-to-FP16 conversion of the weights are now fused into the SGD optimizer, saving trips to memory.
      • GraphScheduler::GrapScheduler was added to control the timing of cudaGraph launching. With GraphScheduler, the gap between adjacent cudaGraphs is eliminated.
    • Multi-node training support on the cluster without RDMA:We support multi-node training without RDMA now. You can specify allreduce algorithm as AllReduceAlgo.NCCL and it can support non-RDMA hardware. For more information, please refer to all_reduce_algo in CreateSolver API.

    • SOK support device setting with tf.configtf.config.set_visible_device can be used to set the visible GPUs for each process. Meanwhile, CUDA_VISIBLE_DEVICES can also be used to achieve the same purpose. When tf.distribute.Strategy is used, device argument must not be set.

    • User defined name is supported in model dumping: We support specifying the model name with the training API CreateSolver, which will be dumped to the JSON configuration file with the API Model.graph_to_json. This feature will facilitate the Triton deployment of saved HugeCTR models, and help to distinguish between models when Kafka sends parameters from the training side to the inference side.

    • Fine-grained control of the embedding layers: We support the fine-grained control of the embedding layers. Users can freeze or unfreeze the weights of a specific embedding layer with the APIs Model.freeze_embedding and Model.unfreeze_embedding. Besides, the weights of multiple embedding layers can be loaded independently, which enables the use case of loading pre-trained embeddings for a particular layer. For more information, please refer to Model API and Section 3.4 of HugeCTR Criteo Notebook.

    Source code(tar.gz)
    Source code(zip)
  • v3.3_alpha(Dec 6, 2021)

  • v3.2.1(Nov 2, 2021)

    What's New in Version 3.2.1

    • Performance optimization on GPU embedding cache: We have optimized the performance of the GPU embedding cache stand-alone module. Now the performance has been significantly improved under small to medium batch sizes. For large batch sizes, the performance remains unchanged. This feature does not introduce any mandatory changes to the interface of the GPU embedding cache, so any existing code that uses this module does not need to change. For more information, please refer to the document of the GPU embedding cache under the gpu_cache folder.

    • Host memory cache for HugeCTR embedding training cache: We have introduced the host memory cache (HMEM-Cache) based PS for the incremental training, which is a component of the Embedding Training Cache (MOS) and responsible for handling the case when the embedding table is too large to even fit into the host memory. We have provided the SSD-based PS for this scenario in former releases, but the SSD-based PS will be deprecated from the v3.3 release due to its unsatisfactory performance. Please check the Host Memory Cache in MOS for a detailed introduction. Compared with the former SSD-based PS, the loading and dumping bandwidth of the HMEM-Cache based PS can be substantially improved if it is properly configured, which contributes to the incremental training of models with huge embedding tables when using the MOS feature in HugeCTR. To ease the utilization of the MOS feature, we have also simplified the python interface of MOS. Specifically, we drop the use_host_memory_ps entry by providing the ps_types entry for choosing the HMEM-based PS or the HMEM-Cache based PS; A unified entry sparse_models is introduced, and you don’t need to use different entries to tell whether a pre-trained embedding table exists or not. For a detailed explanation of the python interface, please check the HugeCTR Python Interface.

    • Debugging Capability Improvement:We have introduced a set of new debugging capability features, which include the multi-level logging, more informative throw and check. We also provide a set of kernel debugging functions. Based on these features, we are actively working on making the information and error messages from HugeCTR cleaner, so that our users are well informed about what is happening with their training & inference code at a desired level. Stay tuned! For more detailed information, check out comments in header files located at HugeCTR/include/base/debug.

    • Embedding cache asynchronous insertion mechanism:We have supported the asynchronous insertion of missing embedding keys into the embedding cache. This feature can be activated automatically through user-defined hit rate threshold in configuration file.When the real hit rate of the embedding cache is higher than the user-defined threshold, the embedding cache will insert the missing key asynchronously, and vice versa, it will still be inserted in a synchronous way to ensure high accuracy of inference requests. Through the asynchronous insertion method, compared with the previous synchronous method, the real hit rate of the embedding cache can be further improved after the embedding cache reaches the user-defined threshold.

    • Performance optimization of Parameter Server:We have added support for multiple database interfaces to our parameter server. In particular, we added an “in memory” database, that utilizes the local CPU memory for storing and recalling embeddings and uses multi-threading to accelerate look-up and storage. Further, we revised support for “distributed” storage of embeddings in a Redis cluster. This way, you can use the combined CPU-accessible memory of your cluster for storing embeddings. The new implementation is up over two orders of magnitude faster than the previous. Further, we performance-optimized support for the “persistent” storage and retrieval of embeddings via RocksDB through the structured use of column families. Creating a hierarchical storage (i.e. using Redis as distributed cache, and RocksDB as fallback), is supported as well. These advantages are free to end-users, as there is no need to adjust the PS configuration. We plan to further integrate the hierarchical parameter server with other features, such as the GPU backed embedding caches in upcoming releases. Stay tuned!

    • Graph Analysis to internalize the Slice layer: The branch topology is inherently supported by the HugeCTR model graph, which requires users to explicitly insert a Slice layer with Python APIs to enable it. In order to simplify the usage, the Slice layer for the branch topology can be abstracted away in the Python interface. The graph analysis will be conducted to resolve the tensor dependency and the Slice layer will be internally inserted if the same tensor is consumed more than once to form the branch topology. The previous usage of explicitly adding the Slice layer is still supported, while using this new feature to internalize it is strongly recommended. Please refer to Getting Started to see how to construct a model graph with branches without the Slice layer. You can refer to Slice Layer for more details.

    Source code(tar.gz)
    Source code(zip)
  • v3.2(Sep 22, 2021)

    What's New in Version 3.2

    • New HugeCTR to ONNX Converter: We’re introducing a new HugeCTR to ONNX converter in the form of a Python package. All graph configuration files are required and model weights must be formatted as inputs. You can specify where you want to save the converted ONNX model. You can also convert sparse embedding models. For more information, refer to HugeCTR to ONNX Converter and HugeCTR2ONNX Demo Notebook.

    • New Hierarchical Storage Mechanicsm on the Parameter Server (POC): We’ve implemented a hierarchical storage mechanism between local SSDs and CPU memory. As a result, embedding tables no longer have to be stored in the local CPU memory. The distributed Redis cluster is being implemented as a CPU cache to store larger embedding tables and interact with the GPU embedding cache directly. The local RocksDB serves as a query engine to back up the complete embedding table on the local SSDs and assist the Redis cluster with looking up missing embedding keys. Please find more information here

    • Parquet Format Support Within the Data Generator: The HugeCTR data generator now supports the parquet format, which can be configured easily using the Python API. For more information, refer to Data Generator API.

    • Python Interface Support for the Data Generator: The data generator has been enabled within the HugeCTR Python interface. The parameters associated with the data generator have been encapsulated into the DataGeneratorParams struct, which is required to initialize the DataGenerator instance. You can use the data generator's Python APIs to easily generate the Norm, Parquet, or Raw dataset formats with the desired distribution of sparse keys. For more information, refer to Data Generator API and Data Generator Samples.

    • Improvements to the Formula of the Power Law Simulator within the Data Generator: We've modified the formula of the power law simulator within the data generator so that a positive alpha value is always produced, which will be needed for most use cases. The alpha values for Long, Medium, and Short within the power law distribution are 0.9, 1.1, and 1.3 respectively. For more information, refer to Data Generator API.

    • Support for Arbitrary Input and Output Tensors in the Concat and Slice Layers: The Concat and Slice layers now support any number of input and output tensors. Previously, these layers were limited to a maximum of four tensors.

    • New Continuous Training Notebook: We’ve added a new notebook to demonstrate how to perform continuous training using the model oversubscription (also referred to as Embedding Training Cache) feature. For more information, refer to HugeCTR Continuous Training.

    • New HugeCTR Contributor Guide: We've added a new HugeCTR Contributor Guide that explains how to contribute to HugeCTR, which may involve reporting and fixing a bug, introducing a new feature, or implementing a new or pending feature.

    • Enhancements to Sparse Operation Kits (SOK): SOK now supports TensorFlow 2.5 and 2.6. We also added support for identity hashing, dynamic input, and Horovod within SOK. Lastly, we added a new SOK docs set to help you get started with SOK.

    • Supporting Arbitrary Number of Inputs in Concat Layer and Slice Layer: The Concat and Slice layers now support any number of input and output tensors, respectively. Previously, these layers would be limited to a maximum of 4 tensors.

    • Fix power law in Data Generator (Generalize the power law simulator in Data Generator): We’ve modified the formula of the power law simulator to make for the positive alpha value, which is more general in different use cases. Besides, the alpha values for Long, Medium and Short of power law distribution are 0.9, 1.1 and 1.3 respectively. For more information, see Data Generator API.

    Source code(tar.gz)
    Source code(zip)
  • v3.1(Aug 3, 2021)

    What's New in Version 3.1

    • MLPerf v1.0 Integration: We've integrated MLPerf optimizations for DLRM training and enabled them as configurable options in Python interface. Specifically, we have incorporated AsyncRaw data reader, HybridEmbedding, FusedReluBiasFullyConnectedLayer, overlapped pipeline, holistic CUDA Graph and so on. The performance of 14-node DGX-A100 DLRM training with Python APIs is comparable to CLI usage. For more information, see HugeCTR Python Interface and DLRM Sample.

    • Enhancements to the Python Interface: We’ve enhanced the Python interface for HugeCTR so that you no longer have to manually create a JSON configuration file. Our Python APIs can now be used to create the computation graph. They can also be used to dump the model graph as a JSON object and save the model weights as binary files so that continuous training and inference can take place. We've added an Inference API that takes Norm or Parquet datasets as input to facilitate the inference process. For more information, see HugeCTR Python Interface and HugeCTR Criteo Notebook.

    • New Interface for Unified Embedding: We’re introducing a new interface to simplify the use of embeddings and datareaders. To help you specify the number of keys in each slot, we added nnz_per_slot and is_fixed_length. You can now directly configure how much memory usage you need by specifying workspace_size_per_gpu_in_mb instead of max_vocabulary_size_per_gpu. For convenience, mean/sum is used in combinators instead of 0 and 1. In cases where you don't know which embedding type you should use, you can specify use_hash_table and let HugeCTR automatically select the embedding type based on your configuration. For more information, see HugeCTR Python Interface.

    • Multi-Node Support for Embedding Training Cache (MOS): We’ve enabled multi-node support for the embedding training cache. You can now train a model with a terabyte-size embedding table using one node or multiple nodes even if the entire embedding table can't fit into the GPU memory. We're also introducing the host memory (HMEM) based parameter server (PS) along with its SSD-based counterpart. If the sparse model can fit into the host memory of each training node, the optimized HMEM-based PS can provide better model loading and dumping performance with a more effective bandwidth. For more information, see HugeCTR Python Interface.

    • Enhancements to the Multi-Nodes TensorFlow Plugin: The Multi-Nodes TensorFlow Plugin now supports multi-node synchronized training via tf.distribute.MultiWorkerMirroredStrategy. With minimal code changes, you can now easily scale your single GPU training to multi-node multi GPU training. The Multi-Nodes TensorFlow Plugin also supports multi-node synchronized training via Horovod. The inputs for embedding plugins are now data parallel, so the datareader no longer needs to preprocess data for different GPUs based on concrete embedding algorithms.

    • NCF Model Support: We've added support for the NCF model, as well as the GMF and NeuMF variant models. With this enhancement, we're introducing a new element-wise multiplication layer and HitRate evaluation metric. Sample code was added that demonstrates how to preprocess user-item interaction data and train a NCF model with it. New examples have also been added that demonstrate how to train NCF models using MovieLens datasets.

    • DIN and DIEN Model Support: All of our layers support the DIN model. The following layers support the DIEN model: FusedReshapeConcat, FusedReshapeConcatGeneral, Gather, GRU, PReLUDice, ReduceMean, Scale, Softmax, and Sub. We also added sample code to demonstrate how to use the Amazon dataset to train the DIN model.

    • Multi-Hot Support for Parquet Datasets: We've added multi-hot support for parquet datasets, so you can now train models with a paraquet dataset that contains both one hot and multi-hot slots.

    • Mixed Precision (FP16) Support in More Layers: The MultiCross layer now supports mixed precision (FP16). All layers now support FP16.

    • Mixed Precision (FP16) Support in Inference: We've added FP16 support for the inference pipeline. Therefore, dense layers can now adopt FP16 during inference.

    • Optimizer State Enhancements for Continuous Training: You can now store optimizer states that are updated during continuous training as files, such as the Adam optimizer's first moment (m) and second moment (v). By default, the optimizer states are initialized with zeros, but you can specify a set of optimizer state files to recover their previous values. For more information about dense_opt_states_file and sparse_opt_states_file, see Python Interface.

    • New Library File for GPU Embedding Cache Data: We’ve moved the header/source code of the GPU embedding cache data structure into a stand-alone folder. It has been compiled into a stand-alone library file. Similar to HugeCTR, your application programs can now be directly linked from this new library file for future use. For more information, see our GPU Embedding Cache ReadMe.

    • Embedding Plugin Enhancements: We’ve moved all the embedding plugin files into a stand-alone folder. The embedding plugin can be used as a stand-alone python module, and works with TensorFlow to accelerate the embedding training process.

    • Adagrad Support: Adagrad can now be used to optimize your embedding and network. To use it, change the optimizer type in the Optimizer layer and set the corresponding parameters.

    Source code(tar.gz)
    Source code(zip)
  • v3.1_beta(May 20, 2021)

    Release Notes

    Bigger model and large scale training are always the main requirements in recommendation system. In v3.1, we provide a set of new optimizations for good scalability as below, and now they are available in this beta version.

    • Distributed Hybrid embedding - Model/data parallel split of embeddings based on statistical access frequency to minimize embedding exchange traffic.
    • Optimized communication collectives - Hierarchical multi-node all-to-all for NVLINK aggregation and oneshot algorithm for All-reduce.
    • Optimized data reader - Async I/O based data reader to maximize I/O utilization, minimize interference with collectives and eval caching.
    • MLP fusions - Fused GEMM + Relu + Bias fprop and GEMM + dRelu + bgrad bprop.
    • Compute-communication overlap - Generalized embedding and bottom MLP overlap.
    • Holistic CUDA graph - Full iteration graph capture to reduce launch latencies and jitter.
    Source code(tar.gz)
    Source code(zip)
  • v3.0.1(Apr 12, 2021)

    What's New in Version 3.0.1

    • DLRM Inference Benchmark: We've added two detailed Jupyter notebooks to illustrate how to train and deploy a DLRM model with HugeCTR whilst benchmarking its performance. The inference notebook demonstrates how to create Triton and HugeCTR backend configs, prepare the inference data, and deploy a trained model by another notebook on Triton Inference Server. It also shows the way of benchmarking its performance (throughput and latency), based on Triton Performance Analyzer. For more details, check out our HugeCTR inference repository.
    • FP16 Speicific Optimization in More Dense Layers: We've optimized DotProduct, ELU, and Sigmoid layers based on __half2 vectorized loads and stores, so that they better utilize device memory bandwidth. Now most layers have been optimized in such a way except MultiCross, FmOrder2, ReduceSum, and Multiply layers.
    • More Finely Tunable Synthetic Data Generator: Our new data generator can generate uniformly distributed datasets in addition to power law based datasets. Instead of specifying vocabulary_size in total and max_nnz, you can specify such information per categorical feature. See our user guide to learn its changed usage.
    • Decreased Memory Demands of Trained Model Exportation: To prevent the out of memory error from happening in saving a trained model including a very large embedding table, the actual amount of memory allocated by the related functions was effectively reduced.
    • CUDA Graph Compatible Dropout Layer: HugeCTR Dropout Layer uses cuDNN by default, so that it can be used together with CUDA Graph. In the previous version, if Dropout was used, CUDA Graph was implicitly turned off.
    Source code(tar.gz)
    Source code(zip)
  • v3.0(Mar 9, 2021)

    Release Notes

    What’s New in Version 3.0

    • Inference Support: To streamline the recommender system workflow, we’ve implemented a custom HugeCTR backend on the NVIDIA Triton Inference Server. The HugeCTR backend leverages the embedding cache and parameter server to efficiently manage embeddings of different sizes and models in a hierarchical manner. For additional information, see our inference repository.

    • New High-Level API: You can now also construct and train your models using the Python interface with our new high-level API. See our preview example code to grasp how it works.

    • FP16 Support in More Layers: All the layers except MultiCross support mixed precision mode. We’ve also optimized some of the FP16 layer implementations based on vectorized loads and stores.

    • Enhanced TensorFlow Embedding Plugin: Our embedding plugin now supports LocalizedSlotSparseEmbeddingHash mode. With this enhancement, the DNN model no longer needs to be split into two parts since it now connects with the embedding op through MirroredStrategy within the embedding layer.

    • Extended Model Oversubscription: We’ve extended the model oversubscription feature to support LocalizedSlotSparseEmbeddingHash and LocalizedSlotSparseEmbeddingHashOneHot.

    • Epoch-Based Training Enhancement: The num_epochs option in the Solver clause can now be used with the Raw dataset format.

    • Deprecation of the eval_batches Parameter: The eval_batches parameter has been deprecated and replaced with the max_eval_batches and max_eval_samples parameters. In epoch mode, these parameters control the maximum number of evaluations. An error message will appear when attempting to use the eval_batches parameter.

    • MultiplyLayer Renamed: To clarify what the MultiplyLayer does, it was renamed to WeightMultiplyLayer.

    • Optimized Initialization Time: HugeCTR’s initialization time, which includes the GEMM algorithm search and parameter initialization, was significantly reduced.

    • Sample Enhancements: Our samples now rely upon the Criteo 1TB Click Logs dataset instead of the Kaggle Display Advertising Challenge dataset. Our preprocessing scripts (Perl, Pandas, and NVTabular) have also been unified and simplified.

    • Configurable DataReader Worker: You can now specify the number of data reader workers, which run in parallel, with the num_workers parameter. Its default value is 12. However, if you are using the Parquet data reader, you can't configure the num_workers parameter since it always corresponds to the number of active GPUs.

    Known Issues

    • Since the automatic plan file generator isn't able to handle systems that contain one GPU, you must manually create a JSON plan file with the following parameters and rename it using the name listed in the HugeCTR configuration file: {"type": "all2all", "num_gpus": 1, "main_gpu": 0, "num_steps": 1, "num_chunks": 1, "plan": [[0, 0]], and "chunks": [1]}.

    • If using a system that contains two GPUs with two NVLink connections, the auto plan file generator will print the following warning message: RuntimeWarning: divide by zero encountered in true_divide. This is an erroneous warning message and should be ignored.

    • The current plan file generator doesn't support a system where the NVSwitch or a full peer-to-peer connection between all nodes is unavailable.

    • Users need to set an export CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable to ensure that the CUDA runtime and driver have a consistent GPU numbering.

    • LocalizedSlotSparseEmbeddingOneHot only supports a single-node machine where all the GPUs are fully connected such as NVSwitch.

    • HugeCTR version 3.0 crashes when running the DLRM sample on DGX2 due to a CUDA Graph issue. To run the sample on DGX2, disable the CUDA Graph by setting the cuda_graph parameter to false even if it degrades the performance a bit. This issue doesn't exist when using the DGX A100.

    • The HugeCTR embedding TensorFlow plugin only works with single-node machines.

    • The HugeCTR embedding TensorFlow plugin assumes that the input keys are in int64 and its output is in float.

    • If the number of samples in a dataset is not divisible by the batch size when in epoch mode and using the num_epochs instead of max_iter, a few remaining samples are truncated. If the training dataset is large enough, its impact can be negligible. If you want to minimize the wasted batches, try adjusting the number of data reader workers. For example, using a file list source, set the num_workers parameter to an advisor based on the number of data files in the file list.

    Source code(tar.gz)
    Source code(zip)
  • v2.3(Nov 24, 2020)

    Release Notes

    What's New in Version 2.3

    We’ve implemented the following enhancements to improve usability and performance:

    • Python Interface: To enhance the interoperability with NVTabular and other Python-based libraries, we're introducing a new Python interface for HugeCTR. If you are already using HugeCTR with JSON, the transition to Python will be seamless for you as you'll only have to locate the hugectr.so file and set the PYTHONPATH environment variable. You can still configure your model in your JSON config file, but the training options such as batch_size must be specified through hugectr.solver_parser_helper() in Python. For additional information regarding how to use the HugeCTR Python API and comprehend its API signature, see our Jupyter Notebook tutorial.

    • HugeCTR Embedding with Tensorflow: To help users easily integrate HugeCTR’s optimized embedding into their Tensorflow workflow, we now offer the HugeCTR embedding layer as a Tensorflow plugin. To better understand how to intall, use, and verify it, see our Jupyter notebook tutorial. It also demonstrates how you can create a new Keras layer EmbeddingLayer based on the hugectr_tf_ops.py helper code that we provide.

    • Model Oversubscription: To enable a model with large embedding tables that exceeds the single GPU's memory limit, we added a new model prefetching feature, giving you the ability to load a subset of an embedding table into the GPU in a coarse grained, on-demand manner during the training stage. To use this feature, you need to split your dataset into multiple sub-datasets while extracting the unique key sets from them. This feature can only currently be used with a Norm dataset format and its corresponding file list. This feature will eventually support all embedding types and dataset formats. We revised our criteo2hugectr tool to support the key set extraction for the Criteo dataset. For additional information, see our Python Jupyter Notebook to learn how to use this feature with the Criteo dataset. Please note that The Criteo dataset is a common use case, but model prefetching is not limited to only this dataset.

    • Enhanced AUC Implementation: To enhance the performance of our AUC computation on multi-node environments, we redesigned our AUC implementation to improve how the computational load gets distributed across nodes.

    • Epoch-Based Training: In addition to max_iter, a HugeCTR user can set num_epochs in the Solver clause of their JSON config file. This mode can only currently be used with Norm dataset formats and their corresponding file lists. All dataset formats will be supported in the future.

    • Multi-Node Training Tutorial: To better support multi-node training use cases, we added a new a step-by-step tutorial.

    • Power Law Distribution Support with Data Generator: Because of the increased need for generating a random dataset whose categorical features follows the power-law distribution, we revised our data generation tool to support this use case. For additional information, refer to the --long-tail description here.

    • Multi-GPU Preprocessing Script for Criteo Samples: Multiple GPUs can now be used when preparing the dataset for our samples. For additional information, see how preprocess_nvt.py is used to preprocess the Criteo dataset for DCN, DeepFM, and W&D samples.

    Known Issues

    • Since the automatic plan file generator is not able to handle systems that contain one GPU, a user must manually create a JSON plan file with the following parameters and rename using the name listed in the HugeCTR configuration file: {"type": "all2all", "num_gpus": 1, "main_gpu": 0, "num_steps": 1, "num_chunks": 1, "plan": [[0, 0]], "chunks": [1]}.
    • If using a system that contains two GPUs with two NVLink connections, the auto plan file generator will print the following warning message: RuntimeWarning: divide by zero encountered in true_divide. This is an erroneous warning message and should be ignored.
    • The current plan file generator doesn't support a system where the NVSwitch or a full peer-to-peer connection between all nodes is unavailable.
    • Users need to set an export CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable to ensure that the CUDA runtime and driver have a consistent GPU numbering.
    • LocalizedSlotSparseEmbeddingOneHot only supports a single-node machine where all the GPUs are fully connected such as NVSwitch.
    • HugeCTR version 2.2.1 crashes when running our DLRM sample on DGX2 due to a CUDA Graph issue. To run the sample on DGX2, disable the use of CUDA Graph with "cuda_graph": false even if it degrades the performance a bit. We are working on fixing this issue. This issue doesn't exist when using the DGX A100.
    • The model prefetching feature is only available in Python. Currently, a user can only use this feature with the DistributedSlotSparseEmbeddingHash embedding and the Norm dataset format on single GPUs. This feature will eventually support all embedding types and dataset formats.
    • The HugeCTR embedding TensorFlow plugin only works with single-node machines.
    • The HugeCTR embedding TensorFlow plugin assumes that the input keys are in int64 and its output is in float.
    • When using our embedding plugin, please note that the fprop_v3 function, which is available in tools/embedding_plugin/python/hugectr_tf_ops.py, only works with DistributedSlotSparseEmbeddingHash.
    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(Sep 18, 2020)

    What’s New in Version 2.2.1

    In HugeCTR version 2.2.1, we enriched the user-convenience features together with the refactoring efforts and bug fixes.

    • Dataset in Parquet Format Support : HugeCTR data reader was extended to support Parquet format. The preprocessed dataset and its metadata can be generated with nvTabular

    • GPU-Powered Preprocessing Script : The preprocessing script used for HugeCTR samples such as DCN, DeepFM and W&D was rewritten in nvTabular. This GPU accelerated script doesn’t use too much host memory anymore.

    • Preprocessing Tool for DLRM Sample : To make it easier for the user to run our DLRM sample, a preprocessing tool written in CUDA C++ was added.

    • Use of RAPIDS MLPrims : Some existing layers were rewritten to utilize the highly optimized machine learning primitives (MLPrims) of RAPIDs.

    • Reorganization of Submodules : All the submodules were moved to third_party directory.

    • Revived Pascal Support : The support for Pascal Architecture, e.g., P100 was added back. However, with a Pascal graphic card, InteractionLayer doesn’t support FP16.

    • Compile Time Reduction : By modularizing the embedding related code into several files, HugeCTR compile time was improved.

    • Refactoring of Tensor and GeneralBuffer : In HugeCTR, Tensor and GeneralBuffer are used for memory management and access control. In the version 2.2.1, they were refactored to clarify their responsibilities and support different memory kinds, .e.g, Host, Device and Unified. Check their interface changes if you are using them to add a new layer.

    Source code(tar.gz)
    Source code(zip)
  • v2.2(Jul 26, 2020)

    New Features in Version 2.2

    HugeCTR version 2.2 adds a lot of features to enhance its usability and performance. HugeCTR is not only a high-performance refereence design for framework designers but also a self contained training framework.

    • Algorithm Search : HugeCTR runs an exhaustive algorithm search for each fully connected layer to find the best performant one.
    • AUC : An user can choose to use AUC as an evaluation metric in addition to AverageLoss. It is also possible to stop training when AUC reaches a speicifed threshold.
    • Batch Shuffle in Training Data Set : Training data batch shuffling is supported.
    • Different Batch Sizes for Training and Evaluation : An user can speicify the different batch sizes for training and evalation. It can be useful to tune overal performance.
    • Full FP16 pipeline : In order to data and compute throughputs, We added the full FP16 pipeline.
    • Fused Fully Connected Layer : In FP16 mode, you can choose to use a specilized fully connected layer fused with ReLU activation function.
    • Evaluation Data Caching on Device : For GPUs with large memory capacity like A100, a user can choose to cache data batches for small evaluation data sets.
    • Interaction Layer : We added Intearction layer used for popular models such as DLRM.
    • Optimized Data Reader for Raw Data Format : RAW data format is supported to simplify the one hot data reading and achieve better performance.
    • Deep Learning Recommendation Model (DLRM) : We eanbled and optimized the training of DLRM. Please find more details in samples/dlrm.
    • Learning Rate Scheduling : Different learning rate scheduling is supported.
    • Weight Initialization Methods : For each trainable layer, a use can choose which method ,e.g., XavierUnifrom, Zero, etc is used for its weight initialization.
    • Ampere Support : We tested and optimized HugeCTR for Ampere Architecture.
    Source code(tar.gz)
    Source code(zip)
  • v2.2-beta(Jun 26, 2020)

    HugeCTR is a high-efficiency GPU framework designed for Click-Through-Rate (CTR) estimation training. In version 2.2 beta release we introduce several important feature updates:

    • Algorithm Search : Support algorithm selection in fully connected layers for better performance.
    • AUC : Support AUC calculation for accuracy evaluation.
    • Batch shuffle and last batch in eval : Support batch shuffle and the last batch during evaluation won’t be dropped.
    • Different batch size in training and evaluation : Support this for best performance in evaluation.
    • Full mixed precision training pipeline: Support full mixed precision training [1].
    • Fused fully connected layer : Fused bias adding and relu activation into a single layer.
    • Caching evaluation data on device : For the GPUs with large memory like A100, we can use caching data for small evaluation data sets.
    • Interaction layer
    • Optimized data reader for raw format
    • Learning rate scheduling

    [1] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, H. Wu Mixed Precision Training https://arxiv.org/abs/1710.03740

    Source code(tar.gz)
    Source code(zip)
  • v2.1_a100_update(Jun 17, 2020)

    HugeCTR is a high-efficiency GPU framework designed for Click-Through-Rate (CTR) estimation training, and the new released NVIDIA A100 GPU has excellent acceleration on various scales for AI, data analysis and high performance computing (HPC), and meet extremely severe computing challenges. To demonstrate HugeCTR’s performance on A100 GPU, this version is developed to leverage new features of the latest GPU.

    Source code(tar.gz)
    Source code(zip)
  • v2.1_a100(Jun 15, 2020)

  • v2.1(May 14, 2020)

    New Features:

    • Supporting three important networks: Wide and Deep Learning (WDL), Deep Cross Network (DCN) and DeepFM.
    • A new embedding implementation LocalizedSlotSparseEmbedding which reduces the memory transactions across GPUs and nodes resiliently to the number of GPUs.
    • Supporting multiple Embeddings in one network.
    • Supporting dense feature input.
    • Supporting new layers like: Dropout / Split / Reshape / Multiply / FmOrder2 / MultCross / Add
    • Check bits in data reader to enable data check and error skip.
    Source code(tar.gz)
    Source code(zip)
Owner
Merlin is a framework providing end-to-end GPU-accelerated recommender systems, from feature engineering to deep learning training and deploying to production
null
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 52 Nov 24, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
This is a small example project, that showcases the possibility of using a surrogate model to estimate the drag coefficient of arbitrary triangles.

flowAroundTriangles This is a small example project, that showcases the possibility of using a surrogate model to estimate the drag coefficient of arb

null 6 Sep 16, 2022
A method to estimate 3D hand gestures from 3D body motion input in body languages.

Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics (CVPR 2021) Project page This repository contains a pytorch implement

Facebook Research 84 Oct 23, 2022
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 19 Jul 25, 2022
4eisa40 GPU computing : exploiting the GPU to execute advanced simulations

GPU-computing 4eisa40 GPU computing : exploiting the GPU to execute advanced simulations Activities Parallel programming Algorithms Image processing O

Ecam 4MIN repositories 2 Jan 10, 2022
YOLOv4 accelerated wtih TensorRT and multi-stream input using Deepstream

Deepstream 5.1 YOLOv4 App This Deepstream application showcases YOLOv4 running at high FPS throughput! P.S - Click the gif to watch the entire video!

Akash James 35 Nov 10, 2022
CUDA-accelerated Apriltag detection and pose estimation.

Isaac ROS Apriltag Overview This ROS2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publishes their poses,

NVIDIA Isaac ROS 46 Dec 1, 2022
Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.

Isaac ROS Visual Odometry This repository provides a ROS2 package that estimates stereo visual inertial odometry using the Isaac Elbrus GPU-accelerate

NVIDIA Isaac ROS 316 Nov 26, 2022
ROS2 packages based on NVIDIA libArgus library for hardware-accelerated CSI camera support.

Isaac ROS Argus Camera This repository provides monocular and stereo nodes that enable ROS developers to use cameras connected to Jetson platforms ove

NVIDIA Isaac ROS 34 Nov 21, 2022
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 647 Nov 28, 2022
ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

Xtra Computing Group 1.4k Nov 22, 2022
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

NVIDIA Corporation 6.2k Nov 23, 2022
A profiler to disclose and quantify hardware features on GPUs.

ArchProbe ArchProbe is a profiling tool to demythify mobile GPU architectures with great details. The mechanism of ArchProbe is introduced in our tech

Microsoft 58 Oct 26, 2022
SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters

SHARK Communication Channels GitHub issues: Feature requests, bugs etc Nod.ai SHARK Discord server: Real time discussions with the nod.ai team and oth

nod.ai 85 Dec 1, 2022
Forward - A library for high performance deep learning inference on NVIDIA GPUs

a library for high performance deep learning inference on NVIDIA GPUs.

Tencent 123 Mar 17, 2021
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Tencent 509 Nov 21, 2022
NVIDIA GPUs htop like monitoring tool

NVTOP What is NVTOP? Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about

Maxime Schmitt 4.5k Nov 29, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 58 Nov 7, 2022