Nvidia contributed CUDA tutorial for Numba

Overview

Numba for CUDA Programmers

Author: Graham Markall, NVIDIA [email protected].

What is this course?

This is an adapted version of one delivered internally at NVIDIA - its primary audience is those who are familiar with CUDA C/C++ programming, but perhaps less so with Python and its ecosystem. That said, it should be useful to those familiar with the Python and PyData ecosystem.

It focuses on using CUDA concepts in Python, rather than going over basic CUDA concepts - those unfamiliar with CUDA may want to build a base understanding by working through Mark Harris's An Even Easier Introduction to CUDA blog post, and briefly reading through the CUDA Programming Guide Chapters 1 and 2 (Introduction and Programming Model). Other concepts discussed in the course (such as shared memory) are discussed in later chapters. For expediency, it is recommended to look up concepts in those sections when necessary, rather than reading all the reference material in detail.

What is in this course?

The course is broken into 5 sessions, designed for a session to be presented then the examples and exercises worked through before participants move to the next session. This could be presented at a cadence of one session per week with an hour of presentation time to fit the course around other tasks. Alternatively it could be delivered as a tutorial session over the course of 2-3 days.

Session 1: An introduction to Numba and CUDA Python

Session 1 files are in the session-1 folder. Contents:

  • Presentation: The presentation for this session, along with notes.
  • Mandelbrot example: See the README for exercises.
  • CUDA Kernels notebook: In the exercises folder. Open the notebook using Jupyter.
  • UFuncs notebooks In the exercises folder. Open the notebooks using Jupyter. Contains two notebooks on vectorize and guvectorize on the CPU (as it's a little easier to experiment with them on the CPU target) and one notebook on CUDA ufuncs and memory management.

Session 2: Typing

Session 2 files are in the session-2 folder. Contents:

  • Presentation: The presentation for this session, along with notes.
  • Exercises: In the exercises folder. Open the notebook using Jupyter.

Session 3: Porting strategies, performance, interoperability, debugging

Session 3 files are in the session-3 folder. Contents:

  • Presentation: The presentation for this session, along with notes.
  • Exercises: In the exercises folder. Open the notebook using Jupyter.
  • Examples: In the examples folder. These are mostly executable versions of the examples given in the slides.

Session 4: Extending Numba

Session 4 files are in the session-4 folder. Contents:

  • Presentation: The presentation for this session, along with notes.
  • Exercises: In the exercises folder. Open the notebook using Jupyter. A solution to the exercise is also provided.
  • Examples: In the examples folder. This contains a notebook working through the Interval example presented in the slides.

Session 5: Memory Management

Session 5 files are in the session-5 folder. Contents:

  • Presentation: The presentation for this session, along with notes.
  • Exercises: In the exercises folder. Open the notebook using Jupyter.
  • Examples: In the examples folder. This contains examples of a simple EMM Plugin wrapping cudaMalloc, and an EMM Plugin for using the CuPy pool allocator with Numba.

Sources

Some of the material in this course is derived from various sources. These sources, are:

References

The following references can be useful for studying CUDA programming in general, and the intermediate languages used in the implementation of Numba:

Issues
  • Session 4 (Extending Numba) should mention `__hash__` and `__eq__` methods of types

    Session 4 (Extending Numba) should mention `__hash__` and `__eq__` methods of types

    Failing to implement these results in weird behaviour for parameterised types:

    • __hash__ is required for correct interning.
    • __eq__ is required to determine if casts are required.
    opened by gmarkall 0
  • Replace use of `definition` and `definitions` with `overloads`

    Replace use of `definition` and `definitions` with `overloads`

    Some parts of the course use kernel definitions from the definition and definitions properties. These properties are deprecated, and should be replaced with the use of overloads instead.

    opened by gmarkall 0
  • Add section on preventing widening integer indices

    Add section on preventing widening integer indices

    The section on the widening on integer indices produced in a loop over a range seems to accidentally be missing - it should be just before the "Limiting register usage" section in session 2.

    This section should explain the use of a while loop and explicit incrementing to generate indices that remain as int32 instead of being prone to widening to 64 bits as is the case with range.

    opened by gmarkall 0
  • Add section on Grid Groups and Grid sync

    Add section on Grid Groups and Grid sync

    Grid groups and grid sync were added in Numba 0.53.1. A short section on using these to implement a global barrier would be good, perhaps based around the example kernel from the documentation:

    @cuda.jit(void(int32[:,::1]))
    def sequential_rows(M):
        col = cuda.grid(1)
        g = cuda.cg.this_grid()
    
        rows = M.shape[0]
        cols = M.shape[1]
    
        for row in range(1, rows):
            opposite = cols - col - 1
            # Each row's elements are one greater than the previous row
            M[row, col] = M[row - 1, opposite] + 1
            # Wait until all threads have written their column element,
            # and that the write is visible to all other threads
            g.sync()
    
    opened by gmarkall 0
Owner
Numba
Array-oriented Python JIT compiler
Numba
📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly

The book claims to be "On the Fly". Its intent is to provide a comprehensive introduction to the relevant features regarding modern C++ (before 2020s). Readers can choose interesting content according to the following table of content to learn and quickly familiarize the new features you would like to learn. Readers should be aware that not all of these features are required. Instead, it should be learned when you really need it.

Changkun Ou 17.9k Jun 28, 2022
Unicorn CPU emulator framework tutorial

使用unicorn-engine开发模拟器 什么是unicorn引擎 Unicorn是基于qemu开发的一个CPU模拟器,支持常见的各种指令集,能在各种系统上运行。 GITHUB项目地址:https://github.com/unicorn-engine/unicorn 官网地址:https://w

null 9 Mar 9, 2022
A Simple 32-bit OS lab tutorial.

一个支点撬动操作系统大山 项目名称:逸仙OS简明教程 ( YatSenOS Volume First ) 所属机构:中山大学操作系统实验课程组 写在前面 本项目已经成功用于2021年中山大学春季操作系统实验课程,课程的地址是https://gitee.com/nelsoncheung/sysu-20

Yat-Sen OS 412 Jun 27, 2022
Zephyr Tutorial for Beginners

Zephyr: Tutorial for Beginners This repository contains a step-by-step guide that teaches you how to use Zephyr RTOS. It assumes: no previous experien

null 84 Jun 21, 2022
C++ OpenGL 3D Game Tutorial Series - Learn to code an OpenGL 3D Game in C++ from scratch

C++ OpenGL 3D Game Tutorial Series is a YouTube Tutorial Series, whose purpose is to help all those who want to take their first steps in the game dev

 PardCode 88 Jun 8, 2022
Tutorial: Writing a "bare metal" operating system for Raspberry Pi 4

Tutorial: Writing a "bare metal" operating system for Raspberry Pi 4

Adam Greenwood-Byrne 2.3k Jun 24, 2022
Tutorial how to use Travis CI with C++

travis_cpp_tutorial Branch master develop richel Tutorial how to use Travis CI with C++. View the tutorial (screen friendly) Download the PDF Want to

Richel Bilderbeek 179 Jun 15, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Vinícius Ginja 35 Apr 7, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 36 Jun 20, 2022
Implementing Nvidia DXR tutorial with Microsoft DXR Fallback Layer

DXRNvTutorial Implemented the Nvidia DXR tutorial with Microsoft DXR Fallback Layer, tested on Nvidia GTX 980Ti (AMD won't work). Extended nv_helpers_

Phil Guo 2 Sep 21, 2019
nvidia nvmpi encoder for streamFX and obs-studio (e.g. for nvidia jetson. Requires nvmpi enabled ffmpeg / libavcodec)

nvmpi-streamFX-obs nvidia nvmpi encoder for streamFX and obs-studio (e.g. for nvidia jetson. Requires nvmpi enabled ffmpeg / libavcodec) Purpose This

null 16 Jun 25, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 91 Jun 13, 2022
Thin C++-flavored wrappers for the CUDA Runtime API

cuda-api-wrappers: Thin C++-flavored wrappers for the CUDA runtime API Branch Build Status: Master | Development: nVIDIA's Runtime API for CUDA is int

Eyal Rozenberg 496 Jun 27, 2022
VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

VexCL VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to redu

Denis Demidov 671 May 25, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
Ethereum miner with OpenCL, CUDA and stratum support

Ethminer is an Ethash GPU mining worker: with ethminer you can mine every coin which relies on an Ethash Proof of Work thus including Ethereum, Ethereum Classic, Metaverse, Musicoin, Ellaism, Pirl, Expanse and others. This is the actively maintained version of ethminer. It originates from cpp-ethereum project (where GPU mining has been discontinued) and builds on the improvements made in Genoil's fork. See FAQ for more details.

null 5.9k Jun 20, 2022
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation

Lattice Boltzmann simulation I am conscious of being only an individual struggling weakly against the stream of time. But it still remains in my power

Long Nguyen 17 Mar 1, 2022
Tiny CUDA Neural Networks

This is a small, self-contained framework for training and querying neural networks. Most notably, it contains a lightning fast "fully fused" multi-layer perceptron as well as support for various advanced input encodings, losses, and optimizers.

NVIDIA Research Projects 1.4k Jun 22, 2022
BM3D denoising filter for VapourSynth, implemented in CUDA

VapourSynth-BM3DCUDA Copyright© 2021 WolframRhodium BM3D denoising filter for VapourSynth, implemented in CUDA Description Please check VapourSynth-BM

null 48 Jun 9, 2022
ppl.cv is a high-performance image processing library of openPPL supporting x86 and cuda platforms.

ppl.cv is a high-performance image processing library of openPPL supporting x86 and cuda platforms.

null 318 Jun 10, 2022
HIPIFY: Convert CUDA to Portable C++ Code

Tools to translate CUDA source code into portable HIP C++ automatically

ROCm Developer Tools 160 Jun 24, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
Cooperative primitives for CUDA C++.

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model

NVIDIA Corporation 1.2k Jun 25, 2022
CUDA-accelerated Apriltag detection and pose estimation.

Isaac ROS Apriltag Overview This ROS2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publishes their poses,

NVIDIA Isaac ROS 36 Jun 24, 2022
CUDA Custom Buffers and example blocks

gr-cuda CUDA Support for GNU Radio using the custom buffer changes introduced in GR 3.10. Custom buffers for CUDA-enabled hardware are provided that c

GNU Radio 4 Dec 9, 2021
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Alex Kotovsky 2 Nov 29, 2021
PlenOctree Volume Rendering (supports CUDA & fragment shader backends)

PlenOctree Volume Rendering This is a real-time PlenOctree volume renderer written in C++ using OpenGL, constituting part of the code release for: Ple

Alex Yu 450 Jun 27, 2022
Parallel algorithms (quick-sort, merge-sort , enumeration-sort) implemented by p-threads and CUDA

程序运行方式 一、编译程序,进入sort-project(cuda-sort-project),输入命令行 make 程序即可自动编译为可以执行文件sort(cudaSort)。 二、运行可执行程序,输入命令行 ./sort 或 ./cudaSort 三、删除程序 make clean 四、指定线程

Fu-Yun Wang 3 May 30, 2022
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Yan haixu 162 Jun 12, 2022