Golang bindings for Nvidia Datacenter GPU Manager (DCGM)

Overview

Bindings

Golang bindings are provided for NVIDIA Data Center GPU Manager (DCGM). DCGM is a set of tools for managing and monitoring NVIDIA GPUs in cluster environments. It's a low overhead tool suite that performs a variety of functions on each host system including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration and accounting.

You will also find samples for these bindings in this repository.

Issues and Contributing

Checkout the Contributing document!

Issues
  • dcgm.GetProcessInfo() function returns

    dcgm.GetProcessInfo() function returns "No data is available"

    I am trying to build a GPU stats collector into some of our tooling, and I am using the "dcgm.GetProcessInfo()" function to get data about various processes. I am trying to use DCGM in embedded mode so I do not have to install the full DCGM package into every host running in our system. However I am having trouble getting it to work reliably as I keep getting "No data is available" errors, similar to https://github.com/NVIDIA/gpu-monitoring-tools/issues/32. I modeled my implementation off the RestAPI example, specifically this: https://github.com/NVIDIA/go-dcgm/blob/main/samples/restApi/handlers/dcgm.go#L114

    I am testing by doing the following:

    1. Run nbody as myself in benchmark mode.
    2. Run my stats collector code as root, passing in the PID of nbody

    In the stats collector, this function is called in a loop, with a sleep of 1-10s between each call:

    func (g *gpuStatManagerImpl) GetProcessInfo(pid uint) ([]stats.GPUStat, error) {
    	if g.dcgmWatchGroup == nil {
    		if err := g.watchPidFields(); err != nil {
    			return nil, err
    		}
    		// If this is the first time this function as been called, sleep for 1s to allow stats collection to init
    		time.Sleep(1 * time.Second)
    	}
    
    	pInfo, err := dcgm.GetProcessInfo(*g.dcgmWatchGroup, pid)
    	if err != nil {
    		return nil, err
    	}
    	gpuStats := make([]stats.GPUStat, len(pInfo))
    	for i, p := range pInfo {
    		gpuStats[i] = stats.GPUStat{
    			DeviceID:          p.GPU,
    			SMUtilization:     p.ProcessUtilization.SmUtil,
    			MemoryUtilization: p.ProcessUtilization.MemUtil,
    			PowerUsageJoules:  p.ProcessUtilization.EnergyConsumed,
    		}
    	}
    
    	if err := g.watchPidFields(); err != nil {
    		return gpuStats, err
    	}
    
    	return gpuStats, nil
    }
    
    func (g *gpuStatManagerImpl) watchPidFields() error {
    	group, err := dcgm.WatchPidFields()
    	if err != nil {
    		return err
    	}
    	g.dcgmWatchGroup = &group
    	return nil
    }
    

    I have tried using the go-dcgm RestAPI example, and the processInfo example, and both get the same "No data is available" error. Here is the nvidia-smi output of the GPU. As you can see, the nbody process is running:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
    | N/A   72C    P0    69W /  70W |   1484MiB / 15109MiB |    100%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A     14335      C   ./nbody                          1481MiB |
    +-----------------------------------------------------------------------------+
    
    opened by SQUIDwarrior 6
  • DCGM exporter is not working on AWS A10 instances (G5.2xlarge, G5.12xlarge)

    DCGM exporter is not working on AWS A10 instances (G5.2xlarge, G5.12xlarge)

    We recently added AWS G5(A10) tier to our Kubernetes clusters. The applications are running alright on the new tier. However the DCGM exporter pods are have a tough time with scheduling and going into CRASHLOOPBACKOFF.

    DCGM Version: 2.3.2-2.6.3-ubuntu20.04 NVIDIA driver version: 460.106.00 OS/Flatcar version: 3139.2.0 Kernel: 5.15.32-flatcar

    Logs: [email protected]:/infrastructure# k logs dcgm-exporter-w9fhs -n monitoring dcgm-exporter time="2022-05-27T08:53:58Z" level=info msg="Starting dcgm-exporter" time="2022-05-27T08:53:58Z" level=info msg="DCGM successfully initialized!" time="2022-05-27T08:53:59Z" level=info msg="Collecting DCP Metrics" time="2022-05-27T08:53:59Z" level=fatal msg="Error watching fields: Feature not supported" [email protected]:/infrastructure#

    opened by vickeyrihal1 3
  • monitor job statistic

    monitor job statistic

    Hello, i want to use go-dcgm to monitor job statistic which may use API "dcgmJobStartStats" and "dcgmJobStopStats". Do you have a plan to implements it? Hope your reply.

    opened by laolaoMonkey 2
  • Segfault using result of FieldValue_v1.String()

    Segfault using result of FieldValue_v1.String()

    While trying to implement https://github.com/NVIDIA/dcgm-exporter/issues/72 I've encountered a segfault when trying to use DCGM_FI_DRIVER_VERSION; specifically if I try to format the result of FieldValue_v1.String().

    I've only written a few hundred lines of Go in my life, but this looks suspicious: it's directly casting a C NUL-terminated string to a Go string, rather than using C.GoString.

    opened by bmerry 2
  • Add new WatchPidFieldsEx method

    Add new WatchPidFieldsEx method

    This modifies the private watchPidFields() method to pass through the timing/sample limit parameters to the underlying C function rather than simply use the defaults. I then added a new WatchPidFieldsEx() method to allow using this feature in code leveraging this library. The existing WatchPidFields() method maintains the original behavior.

    I added this because I wanted the ability to get process-level stats at a higher frequency than the default 30s rate. This allows that to be adjusted by users of this library for different use-cases, while maintaining the default behavior.

    opened by SQUIDwarrior 2
  • Updates to sample READMEs/comments regarding issue #3

    Updates to sample READMEs/comments regarding issue #3

    This is to address #3 as it tries to make it clear why the processInfo and restApi samples will not "just work" when trying to get process-level stats. This also adds or updates some gitignore files to ignore the binaries created in the sample directories, and to ignore the ".idea" directory at the root for devs using IntelliJ IDEs such as GoLand.

    opened by SQUIDwarrior 1
  • Change go-dcgm to be friendlier to users who prefer doing vendoring.

    Change go-dcgm to be friendlier to users who prefer doing vendoring.

    This change moves dcgm headers into the pkg/dcgm directory. Go vendoring does not copy non-go files from subdirectories/sibling directories if there are no go files there. See https://github.com/golang/go/issues/26366 for reference.

    Fixes https://github.com/NVIDIA/dcgm-exporter/issues/21

    Signed-off-by: Nik Konyuchenko [email protected]

    opened by nikkon-dev 0
  • Updates to sample READMEs/comments regarding issue #3

    Updates to sample READMEs/comments regarding issue #3

    This is to address #3 as it tries to make it clear why the processInfo and restApi samples will not "just work" when trying to get process-level stats. This also adds or updates some gitignore files to ignore the binaries created in the sample directories, and to ignore the ".idea" directory at the root for devs using IntelliJ IDEs such as GoLand.

    opened by SQUIDwarrior 0
  • Adding new WatchPidFieldsEx method

    Adding new WatchPidFieldsEx method

    This modifies the private watchPidFields() method to pass through the timing/sample limit parameters to the underlying C function rather than simply use the defaults. I then added a new WatchPidFieldsEx() method to allow using this feature in code leveraging this library. The existing WatchPidFields() method maintains the original behavior.

    I added this because I wanted the ability to get process-level stats at a higher frequency than the default 30s rate. This allows that to be adjusted by users of this library for different use-cases, while maintaining the default behavior.

    opened by SQUIDwarrior 0
  • The processInfo example output is incorrect.

    The processInfo example output is incorrect.

    The fields Avg SM Utilization and Avg Memory Utilization are switched in the template string:

    Avg SM Utilization (%)       : {{or .GpuUtilization.Memory "N/A"}}
    Avg Memory Utilization (%)   : {{or .GpuUtilization.GPU "N/A"}}
    

    https://github.com/NVIDIA/go-dcgm/blob/main/samples/processInfo/main.go#L25-L26

    opened by SQUIDwarrior 0
  • restAPI example throws template error when hitting the /device/info endpoint

    restAPI example throws template error when hitting the /device/info endpoint

    The restAPI example code is throwing this error when you attempt to hit the /device/info endpoint:

    $> curl localhost:8070/dcgm/device/info/id/0
    Driver Version         : 460.91.03
    GPU                    : 0
    DCGMSupported          : Yes
    UUID                   : GPU-957108de-cfa3-9448-9bf9-62961b74aff3
    Brand                  : Unknown
    Model                  : Tesla T4
    Serial Number          : 1560121400704
    Vbios                  : 90.04.96.00.02
    InforomImage Version   : G183.0200.00.02
    Bus ID                 : 00000000:00:1E.0
    BAR1 (MB)              : 256
    FrameBuffer Memory (MB): 15109
    Bandwidth (MB/s)       : 15760
    Cores (MHz)            : template: :14:37: executing "" at <.Clocks.Cores>: can't evaluate field Clocks in type *dcgm.Device
    

    The issue looks to be here: https://github.com/NVIDIA/go-dcgm/blob/d43cd90d89adc11cfec57de0fa0d0de193976058/samples/restApi/handlers/utils.go#L31-L32 where the template string is referencing .Clocks.Cores and .Clocks.Memory which are part of the DeviceStatus struct, not the Device struct.

    opened by SQUIDwarrior 0
Owner
NVIDIA Corporation
NVIDIA Corporation
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 37 Jul 27, 2022
GPU ray tracing framework using NVIDIA OptiX 7

GPU ray tracing framework using NVIDIA OptiX 7

Shunji Kiuchi 22 Jun 11, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 42 Jul 18, 2022
Docker files and scripts to setup and run VINS-FUSION-gpu on NVIDIA jetson boards inside a docker container.

jetson_vins_fusion_docker This repository provides Docker files and scripts to easily setup and run VINS-FUSION-gpu on NVIDIA jetson boards inside a d

Mohamed Abdelkader Zahana 18 May 30, 2022
waifu2x converter ncnn version, runs fast on intel / amd / nvidia GPU with vulkan

waifu2x ncnn Vulkan ncnn implementation of waifu2x converter. Runs fast on Intel / AMD / Nvidia with Vulkan API. waifu2x-ncnn-vulkan uses ncnn project

null 2.2k Aug 7, 2022
4eisa40 GPU computing : exploiting the GPU to execute advanced simulations

GPU-computing 4eisa40 GPU computing : exploiting the GPU to execute advanced simulations Activities Parallel programming Algorithms Image processing O

Ecam 4MIN repositories 2 Jan 10, 2022
A botnet written in C and Golang.

ivy - A botnet written in C and Golang. Ivy is a large, feature-rich botnet written in C and Golang that is extremely easy to use, and comes with a bu

null 3 Dec 25, 2021
Forward - A library for high performance deep learning inference on NVIDIA GPUs

a library for high performance deep learning inference on NVIDIA GPUs.

Tencent 123 Mar 17, 2021
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Tencent 502 Jul 31, 2022
Gstreamer plugin that allows use of NVIDIA Maxine SDK in a generic pipeline.

GST-NVMAXINE Gstreamer plugin that allows use of NVIDIA MaxineTM sdk in a generic pipeline. This plugin is intended for use with NVIDIA hardware. Visi

Alex Pitrolo 14 May 11, 2022
ROS2 packages based on NVIDIA libArgus library for hardware-accelerated CSI camera support.

Isaac ROS Argus Camera This repository provides monocular and stereo nodes that enable ROS developers to use cameras connected to Jetson platforms ove

NVIDIA Isaac ROS 30 Aug 12, 2022
Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.

Isaac ROS Visual Odometry This repository provides a ROS2 package that estimates stereo visual inertial odometry using the Isaac Elbrus GPU-accelerate

NVIDIA Isaac ROS 181 Aug 11, 2022
The core engine forked from NVidia's Q2RTX. Heavily modified and extended to allow for a nicer experience all-round.

Nail & Crescent - Development Branch Scratchpad - Things to do or not forget: Items are obviously broken. Physics.cpp needs more work, revising. Proba

PalmliX Studio 13 Jul 6, 2022
NVIDIA Texture Tools samples for compression, image processing, and decompression.

NVTT 3 Samples This repository contains a number of samples showing how to use NVTT 3, a GPU-accelerated texture compression and image processing libr

NVIDIA DesignWorks Samples 31 Jun 13, 2022
NVIDIA Image Scaling SDK

NVIDIA Image Scaling SDK v1.0 The MIT License(MIT) Copyright(c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved. Permission is hereby grante

NVIDIA GameWorks 359 Aug 4, 2022
Vendor and game agnostic latency reduction middleware. An alternative to NVIDIA Reflex.

LatencyFleX (LFX) Vendor and game agnostic latency reduction middleware. An alternative to NVIDIA Reflex. Why LatencyFleX? There is a phenomenon commo

Tatsuyuki Ishi 489 Aug 5, 2022
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

NVIDIA Corporation 5.7k Aug 8, 2022
Dataset Synthesizer - NVIDIA Deep learning Dataset Synthesizer (NDDS)

NVIDIA Deep learning Dataset Synthesizer (NDDS) Overview NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-qualit

NVIDIA Corporation 506 Jul 23, 2022