An efficient C++17 GPU numerical computing library with Python-like syntax

Overview

MatX - Matrix Primitives Library

GitHub Releases GitHub License

MatX is a modern C++ library for numerical computing on NVIDIA GPUs. Near-native performance can be achieved while using a simple syntax common in higher-level languages such as Python or MATLAB.

FFT resampler

The above image shows the Python (Numpy) version of an FFT resampler next to the MatX version. The total runtimes the NumPy version, CuPy version, and MatX version are shown below:

  • Python/Numpy: 4500ms (Xeon(R) CPU E5-2698 v4 @ 2.20GHz)
  • CuPy: 10.6ms (A100)
  • MatX: 2.54ms (A100)

While the code complexity and length are roughly the same, the MatX version shows a 1771x over the Numpy version, and over 4x faster than the CuPy version on the same GPU.

Key features include:

  • MatX is fast. By using existing, optimized libraries as a backend, and efficient kernel generation when needed, no hand-optimizations are necessary

  • 👐 MatX is easy to learn. Users familiar with high-level languages will pick up the syntax quickly

  • 📑 MatX easily integrates with existing libraries and code

  • 🎇 Visualize data from the GPU right on a web browser

  • ↕️ IO capabilities for reading/writing files

Table of Contents

Requirements

MatX is using bleeding edge features in the CUDA compilers and libraries. For this reason, CUDA 11.2 and g++9 or newer is required. You can download the CUDA Toolkit here.

MatX has been tested on and supports Pascal, Turing, Volta, and Ampere GPU architectures. We currently do not support the Jetson embedded GPUs, as JetPack currently ships with CUDA 10.2.

Installation

MatX is a header-only library that does not require compiling for using in your applications. However, building unit tests, benchmarks, or examples must be compiled. CPM is used as a package manager for CMake to download and configure any dependencies. If MatX is to be used in an air-gapped environment, CPM can be configured to search locally for files. Depending on what options are enabled, compiling could take very long without parallelism enabled. Using the -j flag on make is suggested with the highest number your system will accommodate.

Building MatX

To build all components, issue the standard cmake build commands in a cloned repo:

mkdir build && cd build
cmake -DBUILD_TESTS=ON -DBUILD_BENCHMARKS=ON -DBUILD_EXAMPLES=ON -DBUILD_DOCS=ON ..
make -j

By default CMake will target the GPU architecture(s) of the system you're compiling on. If you wish to target other architectures, pass the CMAKE_CUDA_ARCHITECTURES flag with a list of architectures to build for:

cmake .. -CMAKE_CUDA_ARCHITECTURES="60;70"

By default nothing is compiled. If you wish to compile certain options, use the CMake flags below with ON or OFF values:

BUILD_TESTS
BUILD_BENCHMARKS
BUILD_EXAMPLES
BUILD_DOCS

For example, to enable unit test building:

mkdir build && cd build
cmake -DBUILD_TESTS=ON ..
make -j

Note that if documentation is selected all other build options are off. This eases the dependencies needed to build documentation so large libraries such as CUDA don't need to be installed.

Integrating MatX With Your Own Projects

MatX uses CMake as a first-class build generator, and therefore provides the proper config files to include into your own project. There are typically two ways to do this:

  1. Adding MatX as a subdirectory
  2. Installing MatX to the system

MatX as a Subdirectory

Adding the subdirectory is useful if you include the MatX source into the directory structure of your project. Using this method, you can simply add the MatX directory:

add_subdirectory(path/to/matx)

MatX Installed to the System

The other option is to install MatX and use the configuration file provided after building. This is typically done in a way similar to what is shown below:

cd /path/to/matx
mkdir build && cd build
cmake ..
make && make install

If you have the correct permissions, the headers and cmake packages will be installed on your system in the expected paths for your operating system. With the package installed you can use find_package as follows:

find_package(matx CONFIG REQUIRED)

An example of using this method can be found in the examples/cmake_sample_project directory

MatX CMake Targets

Once either of the two methods above are done, you can use the transitive target matx::matx in your library inside of target_link_libraries. MatX may add other optional targets in the future inside the matx:: namespace as well.

Documentation

Documentation for MatX can be built locally as shown above with the DBUILD_DOCS=ON cmake flag. Building documentation requires the following to be installed: doxygen, breathe, sphinx, sphinx-rtd-theme, libjs-mathjax, texlive-font-utils, flex, bison

A hosting site for documentation is coming soon.

We are currently using semantic versioning and reserve the right to introduce breaking API changes on major releases.

Supported Data Types

MatX supports all types that use standard C++ operators for math (+, -, etc). Unit tests are run against all common types shown below.

  • Integer: int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t
  • Floating Point: matxFp16 (fp16), matxBf16 (bfloat16), float, double
  • Complex: matxfp16Complex, matxBf16Complex, cuda::std::complex, cuda::std::complex

Since CUDA half precision types (__half and __nv_bfloat16) do not support all C++ operators on the host side, MatX provides the matxFp16 and matxBf16 types for scalars, and matxFp16Complex and matxBf16Complex for complex types. These wrappers are needed so that tensor views can be evaluated on both the host and device, regardless of CUDA or hardware support. When possible, the half types will use hardware- accelerated intrinsics automatically. Existing code using __half and __nv_bfloat16 may be converted to the matx equivalent types directly and leverage all operators.

Unit Tests

MatX contains a suite of unit tests to test functionality of the primitive functions, plus end-to-end tests of example code. MatX uses pybind11 to generate some of the unit test inputs and outputs. This avoids the need to store large test vector files in git, and instead can be generated as-needed.

To run the unit tests, from the cmake build directory run:

make test

This will execute all unit tests defined. If you wish to execute a subset of tests, or run with different options, you may run test/matx_test directly with parameters defined by Google Test. To run matx_test directly, you must be inside the build/test directory for the correct paths to be set. For example, to run only tests with the name FFT:

cd build/test
./matx_test --gtest_filter="*FFT*"

Quick Start Guide

A quick start guide can be found in the docs directory. Further, for new MatX developers, browsing the example applications can provide familarity with the API and best practices.

Filing Issues

We welcome and encourage the creation of issues against MatX. When creating a new issue, please use the following syntax in the title of your submission to help us prioritize responses and planned work.

  • Bug Report: Append [BUG] to the beginning of the issue title, e.g. [BUG] MatX fails to build on P100 GPU
  • Documentation Request: Append [DOC] to the beginning of the issue title
  • Feature Request: Append [FEA] to the beginning of the issue title
  • Submit a Question: Append [QST] to the beginning of the issue title

As with all issues, please be as verbose as possible and, if relevant, include a test script that demonstrates the bug or expected behavior. It's also helpful if you provide environment details about your system (bare-metal, cloud GPU, etc).

Contributing Guide

Please review the CONTRIBUTING.md file for information on how to contribute code and issues to MatX. We require all pull requests to have a linear history and rebase to main before merge.

Comments
  • [FEA] Implementation of MATLAB's find()

    [FEA] Implementation of MATLAB's find()

    First of all, let me just say that I'm excited to see this project. I'm a long-time user of MATLAB but very new to CUDA, and I love that you are lowering the barrier of entry to GPU programming.

    Is your feature request related to a problem? Please describe. Slicing a matrix is often paired with MATLAB's built-in find function. I think it would be a valuable addition to this project.

    Describe the solution you'd like Similar to how you have implemented linspace, meshgrid, etc, it would be great to see a similar syntax & functionality to the above link.

    Describe alternatives you've considered I just stumbled upon this library today--I read through all the help docs and didn't see this anywhere. Hopefully I didn't miss it!

    opened by akfite 18
  • [FEA] Tensor contractions

    [FEA] Tensor contractions

    Great work so far on MatX!

    I wonder if tensor contractions (aka tensordot or einsum) are in the roadmap for MatX. Until now this has existed in cuTENSOR but it is quite verbose, so it would be great to write tensor contractions using MatX high-level syntax.

    enhancement 
    opened by DavidAce 12
  • [QST] Transform Reduce - Kernel Fusion possible?

    [QST] Transform Reduce - Kernel Fusion possible?

    Use the form below to ask a question about MatX. This can be anything from a code question to questions about releases. Hi, on first sight of this project I had hoped that the deferred executors would allow for fusing transformations into reductions. But after trying it out I'm not so sure anymore:

    int const row = 3;
    int const col = 4;
    auto tensor = matx::make_tensor<int, 2>({row, col});
    tensor.SetVals({{3, 2, 4, 5},
                    {0, -2, 3, 1},
                    {9, 8, 7, 6}});
    tensor.Print(0,0);
    auto count = matx::make_tensor<int, 1>({row});
    
    // Goal: For each row, count the number of elements greater than 2.
    matx::sum(count, tensor > 2);
    cudaDeviceSynchronize();
    count.Print(0);
    

    This code compiles and runs, but the result is not the expected {3, 1, 4}, but {1, 1, 1}. Running the transformation and reduction independently works as expected:

    (tensor = tensor > 2).run();
    matx::sum(count, tensor);
    

    Is or will it be possible to avoid unnecessary reads/writes to tensor by kernel fusion in MatX?

    opened by pauleonix 11
  • Unable to compile sample application

    Unable to compile sample application

    Describe the bug Unable to compile sample application against matx and the .cu file does nothing but #include <matx.h>

    These are the compile errors

    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(135): error: namespace "cuda::std::__3::detail" has no member "__atomic_thread_fence_cuda"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(135): error: namespace "cuda::std::__3::detail" has no member "__thread_scope_system_tag"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(142): error: namespace "cuda::std::__3::detail" has no member "__atomic_signal_fence_cuda"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: namespace "cuda::std::__3::detail" has no member "__atomic_load_n_cuda"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: namespace "cuda::std::__3::detail" has no member "__scope_tag"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: expected an expression
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: no instance of overloaded function "cuda::std::__3::__cxx_atomic_alignment_unwrap" matches the argument list
                argument types are: (<error-type>)
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: namespace "cuda::std::__3::detail" has no member "__atomic_exchange_n_cuda"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: namespace "cuda::std::__3::detail" has no member "__scope_tag"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: expected an expression
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: no instance of overloaded function "cuda::std::__3::__cxx_atomic_alignment_unwrap" matches the argument list
                argument types are: (<error-type>)
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(215): error: namespace "cuda::std::__3::detail" has no member "__atomic_store_n_cuda"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(215): error: namespace "cuda::std::__3::detail" has no member "__scope_tag"
    
    /usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(215): error: expected an expression
    
    

    To Reproduce Steps to reproduce the behavior:

    1. Compile the sample application with CUDA 11.4

    Expected behavior compilation

    Code snippers

    cmake_minimum_required(VERSION 3.18)
    
    project(SAMPLE_MATX LANGUAGES CUDA CXX)
    find_package(CUDAToolkit 11.4 REQUIRED)
    set(CMAKE_CUDA_ARCHITECTURES 75)
    
    find_package(matx CONFIG REQUIRED)
    
    add_executable(sample_matx main.cu)
    target_link_libraries(sample_matx PRIVATE matx::matx)
    

    System details (please complete the following information):

    • OS: Ubuntu 20.04
    • CUDA version: 11.4
    • g++ version:9.3

    Additional context Does the sample application require CUDA 11.5?

    wontfix 
    opened by mormj 11
  • [BUG] einsum is not working

    [BUG] einsum is not working

    Describe the bug matx::cutensor::einsum outputs wrong results.

    To Reproduce Steps to reproduce the behavior:

    1. Create a 2x3 tensor a:
    auto a = matx::make_tensor<float, 2>({2, 3});
    a.SetVals({
        {1, 2, 3},
        {4, 5, 6}
    });
    
    1. Perform einsum operation to reduce sum a:
    auto a_reduced = matx::make_tensor<float, 1>({3});
    matx::cutensor::einsum(a_reduced, "ij->j", 0, a);
    
    1. Print the result:
    cudaStreamSynchronize(0);
    a_reduced.Print();
    

    image

    The shape is correct but all values are zeros.

    Expected behavior

    NumPy result: image

    System details (please complete the following information):

    • OS: Ubuntu 20.04
    • CUDA version: CUDA 11.4
    • g++ version: 9.3.0
    opened by ZJUGuoShuai 9
  • more flexible memory handling

    more flexible memory handling

    1. You chose a managed memory scheme. This is good for entry level, but to consider it for actual systems it would be nice to have some more allocators implemented within the library (or just support the ones from RAPIDS).
    2. In the absence of managed memory it would be nice to have a (Pinned memory) CPU variation of the tensor that handles GPU/CPU copy. Also with allocators...
    3. Vector type support would be nice (uchar4...)

    This sort of basic data structure is very much in need and the lazy execution model is looking compact and useful...

    opened by trporn 9
  • [BUG] Unit Test Failure

    [BUG] Unit Test Failure

    Most of the unit tests pass. I think this one may be towards the end of the list. One test fails on an assertion, and a subsequent test throws an exception.

    After pulling updates from the main branch, built and run using:

    cmake -DBUILD_TESTS=ON ..
    make -j
    cd test
    ./matx_test
    

    Output of failures:

    Comparison failed at /path/to/MatX/test/00_transform/MatMul.cu:136:0/64: val=0.438721+13.132812j file=8.766909+-0.536683j (c)
    /path/to/MatX/test/00_transform/MatMul.cu:136: Failure
    Failed
    [  FAILED  ] MatMulTestFloatTypes/6.MediumRect, where TypeParam = matx::matxHalfComplex<matx::matxHalf<__half> > (255 ms)
    [----------] 3 tests from MatMulTestFloatTypes/6 (262 ms total)
    
    [----------] 3 tests from MatMulTestFloatTypes/7, where TypeParam = matx::matxHalfComplex<matx::matxHalf<__nv_bfloat16> >
    [ RUN      ] MatMulTestFloatTypes/7.SmallRect
    matxException (matxMatMulError: ret == CUBLAS_STATUS_SUCCESS) - /path/to/MatX/include/matx_matmul.h:505
    
    Stack Trace:
     ./matx_test() [0x4cd619]
     ./matx_test() [0xd58d45]
     ./matx_test() [0xd4ed50]
     ./matx_test() [0xd40f9a]
     ./matx_test() [0xd3873b]
     ./matx_test : void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x65
     ./matx_test : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x4b
     ./matx_test : testing::Test::Run()+0xea
     ./matx_test : testing::TestInfo::Run()+0x122
     ./matx_test : testing::TestSuite::Run()+0x133
     ./matx_test : testing::internal::UnitTestImpl::RunAllTests()+0x3c1
     ./matx_test : bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x65
     ./matx_test : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x4b
     ./matx_test : testing::UnitTest::Run()+0xaa
     ./matx_test() [0x4a3c99]
     ./matx_test() [0x4a27d6]
     /lib64/libc.so.6 : __libc_start_main()+0xf5
     ./matx_test() [0x4a266e]
    

    System details:

    • GPU: Quadro RTX 5000
    • OS: CentOS 7
    • CUDA version: 11.5
    • gcc/g++ version: 9.3.1 20200408 (Red Hat 9.3.1-2)
    • cmake version: 3.22.1
    opened by seanajohnston 8
  • Giving an error about index_t

    Giving an error about index_t

    Hi, I was trying this API . I used the example simple_pipeline.cu

    I tried to compile it with nvcc -I ../include/ ./simple_pipeline.cu . The error I got was something like :

    
    ../include/matx_tensor_ops.h(941): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(947): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(979): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(999): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1018): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1037): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1056): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1087): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1138): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1189): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1241): error: identifier "index_t" is undefined
    
    ../include/matx_tensor_ops.h(1300): error: identifier "index_t" is undefined
    
    Error limit reached.
    100 errors detected in the compilation of "./simple_pipeline.cu".
    Compilation terminated.
    

    How can I overcome this issue . I have Toolkit 11.6 and 3070 maxq ? Can you provide script to compile it properly .

    opened by RiShAbHjOsHibot 7
  • [BUG] A simple `matx::sum` test case failed to compile since this commit

    [BUG] A simple `matx::sum` test case failed to compile since this commit

    Describe the bug This simple matx::sum test case failed to compile on the latest commit.

    #include <matx.h>
    
    int main() {
        auto t1 = matx::make_tensor<float, 2>({32, 10});
        auto t2 = matx::make_tensor<float, 1>({32});
    
        matx::sum(t2, t1, 0);
    
        cudaDeviceSynchronize();
    
        return 0;
    }
    

    After binary-searching the commits, I found that this will happen since commit 77a0d4c6b04fcc8027563127500d4735e99c3cb7.

    To Reproduce Steps to reproduce the behavior:

    1. Try to compile the test code above with MatX after commit 77a0d4c6b04fcc8027563127500d4735e99c3cb7.
    2. Output from compiler:
    Consolidate compiler generated dependencies of target test
    [ 50%] Building CUDA object CMakeFiles/test.dir/main.cu.o
    /root/gs/MatX-Install-Home/include/matx_cub.h(724): error: no instance of function template "cub::DeviceSegmentedReduce::Sum" matches the argument list
                argument types are: (float *, size_t, float *, float *, int, matx::detail::BeginOffset<matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>>, matx::detail::EndOffset<matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>>, const cudaStream_t)
              detected during:
                instantiation of "void matx::detail::matxCubPlan_t<OutputTensor, InputOperator, op, CParams>::ExecSum(OutputTensor &, const InputOperator &, cudaStream_t) [with OutputTensor=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InputOperator=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, op=(matx::detail::CUBOperation_t)4, CParams=matx::detail::EmptyParams_t]" 
    (259): here
                instantiation of "matx::detail::matxCubPlan_t<OutputTensor, InputOperator, op, CParams>::matxCubPlan_t(OutputTensor &, const InputOperator &, const CParams &, cudaStream_t) [with OutputTensor=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InputOperator=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, op=(matx::detail::CUBOperation_t)4, CParams=matx::detail::EmptyParams_t]" 
    (1249): here
                instantiation of "void matx::cub_sum(OutputTensor &, const InputOperator &, cudaStream_t) [with OutputTensor=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InputOperator=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>]" 
    /root/gs/MatX-Install-Home/include/matx_reduce.h(1225): here
                instantiation of "void matx::sum(TensorType &, const InType &, cudaStream_t) [with TensorType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>]" 
    /root/gs/MatX-Test/main.cu(11): here
    
    /root/gs/MatX-Install-Home/include/matx_cub.h(733): error: no instance of function template "cub::DeviceSegmentedReduce::Sum" matches the argument list
                argument types are: (float *, size_t, matx::RandomOperatorIterator<matx::detail::tensor_impl_t<float, 2, matx::DefaultDescriptor<2>>>, float *, int, matx::detail::BeginOffset<matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>>, matx::detail::EndOffset<matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>>, const cudaStream_t)
              detected during:
                instantiation of "void matx::detail::matxCubPlan_t<OutputTensor, InputOperator, op, CParams>::ExecSum(OutputTensor &, const InputOperator &, cudaStream_t) [with OutputTensor=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InputOperator=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, op=(matx::detail::CUBOperation_t)4, CParams=matx::detail::EmptyParams_t]" 
    (259): here
                instantiation of "matx::detail::matxCubPlan_t<OutputTensor, InputOperator, op, CParams>::matxCubPlan_t(OutputTensor &, const InputOperator &, const CParams &, cudaStream_t) [with OutputTensor=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InputOperator=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, op=(matx::detail::CUBOperation_t)4, CParams=matx::detail::EmptyParams_t]" 
    (1249): here
                instantiation of "void matx::cub_sum(OutputTensor &, const InputOperator &, cudaStream_t) [with OutputTensor=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InputOperator=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>]" 
    /root/gs/MatX-Install-Home/include/matx_reduce.h(1225): here
                instantiation of "void matx::sum(TensorType &, const InType &, cudaStream_t) [with TensorType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<1>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>]" 
    /root/gs/MatX-Test/main.cu(11): here
    
    2 errors detected in the compilation of "/root/gs/MatX-Test/main.cu".
    make[3]: *** [CMakeFiles/test.dir/build.make:76: CMakeFiles/test.dir/main.cu.o] Error 1
    make[2]: *** [CMakeFiles/Makefile2:83: CMakeFiles/test.dir/all] Error 2
    make[1]: *** [CMakeFiles/Makefile2:90: CMakeFiles/test.dir/rule] Error 2
    make: *** [Makefile:124: test] Error 2
    

    Expected behavior This test case should compile.

    System details (please complete the following information):

    • OS: Ubuntu 20.04
    • CUDA version: CUDA 11.4
    • g++ version: 9.3.0
    opened by ZJUGuoShuai 6
  • [FEA] Jetson Support

    [FEA] Jetson Support

    Thank you for this nice library!

    Is Jetson support on the horizon and if so, what is the timeline for that?

    I am considering starting a project that could build on MatX but it would be vital for the project to eventually support Nvidia Jetsons.

    Thanks a lot Cheers Lukas

    opened by lkskstlr 6
  • [BUG] SetVals failed on tensors that are created from user pointers

    [BUG] SetVals failed on tensors that are created from user pointers

    Describe the bug When using SetVals on tensors that are created from user pointers, a SegmentFault is thrown.

    To Reproduce

    float* dev_float;
    cudaMalloc(&dev_float, sizeof(float) * 6);
    
    auto t = matx::make_tensor<float, 2, matx::non_owning>(dev_float, {2, 3});
    t.SetVals({{1, 2, 3}, {4, 5, 6}});
    t.Print();
    

    VSCode Debug Error: image

    Expected behavior The SetVals should work for tensors created from user pointers.

    System details (please complete the following information):

    • OS: Ubuntu 20.04
    • CUDA version: CUDA 11.4
    • g++ version: 9.3.0
    opened by ZJUGuoShuai 5
  • [FEA] Add support for pwelch()

    [FEA] Add support for pwelch()

    pwelch() is commonly used in signal processing for visualizing spectrum without preprocessing first. This feature request is for preliminary support to match basic use cases of scipy/cusignal's pwelch()

    opened by cliffburdick 0
  • Rework Random Tensors APIs [FEA]

    Rework Random Tensors APIs [FEA]

    Random Tensor APIs were some of our first APIs. We have moved away from that style of API and need to update.

    We should depericate/delete tensorShape_t references. We should get rid of GetTensorView from the generator class.

    We should add APIs in this form which inherits rank from shape.

    Possible APIS auto rand_view = rand(shape, generator, NORMAL); auto rand_view = randn(shape, generator); auto rand_view = generator.normal(shape);

    I would love feedback on which APIs are best.

    opened by luitjens 0
  • [FEA] Break up unit test TUs

    [FEA] Break up unit test TUs

    Unit tests take a very long time to compile on some systems. Parallelism helps, but not as much as it could given that many files have dozens of tests. This task is to break up the tests into more files to increase parallelism.

    opened by cliffburdick 0
  • [BUG] Convolution needs more unit tests appears broken.  2D needs to support arbitrary batching/ranks.

    [BUG] Convolution needs more unit tests appears broken. 2D needs to support arbitrary batching/ranks.

    Convolution needs more unit tests for both 1D and 2D.

    There are things that look like they are probably bugs but are not currently covered by unit tests.

    For example:

    throughout the code the x dim is the last dim (fastest changing):

    if constexpr (d_in.Rank() == 4) { bdims[0] = blockIdx.z / d_in.Size(1); bdims[1] = blockIdx.z - (bdims[0] * d_in.Size(1)); ix_size = d_in.Size(3); iy_size = d_in.Size(2); } if constexpr (d_in.Rank() == 3) { ix_size = d_in.Size(2); iy_size = d_in.Size(1); } else if constexpr (d_in.Rank() == 2) { ix_size = d_in.Size(1); iy_size = d_in.Size(0); }

    But then later when checking bounds we do it in reverse:

    if ((threadIdx.x < d_filter.Size(1)) && (threadIdx.y < d_filter.Size(0))) { s_filter[d_filter.Size(1) * threadIdx.y + threadIdx.x] = d_filter(threadIdx.y, threadIdx.x); }

    for (int x = 0; x < d_filter.Size(1); x++) { if ((tid_x - static_cast(d_filter.Size(1)) + 1 + x < 0) || (tid_x - static_cast(d_filter.Size(1)) + 1 + x >= ix_size)) { continue; }

    for (int y = 0; y < d_filter.Size(0); y++) {
      if ((tid_y - static_cast<int>(d_filter.Size(0)) + 1 + y < 0) ||
          (tid_y - static_cast<int>(d_filter.Size(0)) + 1 + y >= iy_size)) {
        continue;
      }
    

    We should also take this as an opportunity to allow batching with > rank4 tensors and batch the filters too.

    opened by luitjens 0
  • [BUG] Visual studio fails to compile unit tests/examples

    [BUG] Visual studio fails to compile unit tests/examples

    Describe the bug CMake generates a solution that fails to compile 11 of 20 projects on VS2022.

    From #147 :

    The problem is not that nvcc is being passed an incorrect flag, but rather that fvisibility is not valid on VS. We use the option -forward-unknown-to-host-compiler, so any unknown parameter (of which this is one), nvcc will automatically forward to VS.

    To Reproduce Steps to reproduce the behavior:

    cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=ON -DMATX_BUILD_EXAMPLES=ON -DMATX_BUILD_DOCS=OFF -DCMAKE_CUDA_ARCHITECTURES=52 -DCMAKE_BUILD_TYPE=Debug ..
    

    Expected behavior Expect to successfully compile all unit tests & examples.

    Code snippers output log attached.

    System details (please complete the following information):

    • Windows 10 Pro
    • CMake 3.22.1
    • VS2022 (MSVC 19.30.30706.0)
    • CUDA 11.6
    • pybind11 2.6.2
    opened by akfite 3
Releases(v0.2.5)
  • v0.2.4(Mar 31, 2022)

  • v0.2.3(Mar 23, 2022)

    • Improved error messages
    • Added support for einsum function. Includes tensor contractions, GEMMs with transposed outputs, dot products, and trace
    • Integrated cuTENSOR library
    • Added real/imag/r2c operators
    • Added chirp function
    • Added file readers for .mat files
    • Fixes to conv2, fft2
    • Switched to CUB for certain reductions. Results in a 4x speedup in some cases
    • Added find() and find_idx() functions
    • Added unique() function
    • Many CMake fixes to clean up transitive target
    • Added casting operators
    • Added negate operator
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Dec 24, 2021)

  • v0.2.1(Dec 17, 2021)

  • v0.2.0(Dec 16, 2021)

    This release adds major changes with the main tensor class to allow for custom types for storage and descriptors. In addition, static tensor descriptors are now possible for compile time pointer arithmetic. As of this release it is not longer recommended to construct tensor_t objects directly. Instead, prefer the make_ variants of the functions.

    Other features of this release are:

    • Refactored tensor class to use generic storage and descriptors

    • Adding comments on all make functions. Fixing spectrogram examples

    • Added concatenation operator

    • Added static tensors

    • Adding const on all operator() where applicable

    • Add more creation of tensors

    • Changed convolution example to use static tensor sizes

    • Added documentation for make

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Nov 8, 2021)

    • Added make_tensor helper functions
    • Updated Black-Scholes example
    • Moved host-specific defines into separate file
    • Updated build system to better track libcuda++ and nvbench
    • Improved release mode speed by turning off assertion checking
    • Improved host operator creation time by storing intermediate variables
    • Updated recursive filter example to error if not enough shared memory is available
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Oct 26, 2021)

    First public release of MatX. Brief list of supported features are:

    • Frontend API for cuBLAS, CUTLASS, cuFFT, cuSolver, cuRAND, and CUB
    • All standard POD data types supported, as well as fp16/bf16 and complex
    • Template expression trees to generate optimized device kernels
    • Examples for both performance and accuracy
    • Over 500 unit tests
    • Benchmarks using nvbench
    • Native CMake build system
    • and more!
    Source code(tar.gz)
    Source code(zip)
Owner
NVIDIA Corporation
NVIDIA Corporation
Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and deploy without Python.

Python Inference Script(PyIS) Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and d

Microsoft 11 Aug 10, 2022
Nano - C++ library [machine learning & numerical optimization] - superseeded by libnano

Nano Nano provides numerical optimization and machine learning utilities. For example it can be used to train models such as multi-layer perceptrons (

Cosmin Atanasoaei 1 Apr 18, 2020
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.7k Oct 3, 2022
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 59.2k Sep 30, 2022
GPTPU: General-Purpose Computing on (Edge) Tensor Processing Units

GPTPU: General-Purpose Computing on (Edge) Tensor Processing Units Welcome to the repository of ESCAL @ UCR's GPTPU project! We aim at demonstrating t

Extreme Storage and Computer Architecture Lab 32 Jul 31, 2022
Boki: Stateful Serverless Computing with Shared Logs [SOSP '21]

Boki Boki is a research FaaS runtime for stateful serverless computing with shared logs. Boki exports the shared log API to serverless functions, allo

Operating Systems and Architecture 35 Sep 14, 2022
Experimental and Comparative Performance Measurements of High Performance Computing Based on OpenMP and MPI

High-Performance-Computing-Experiments Experimental and Comparative Performance Measurements of High Performance Computing Based on OpenMP and MPI 实验结

Jiang Lu 1 Nov 27, 2021
SMID, Parallel computing of CNN

Parallel Computing in Deep Reference Network 1. Introduction Deep neural networks are made up of a number of layers of linked nodes, each of which imp

null 1 Dec 22, 2021
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

Xiaomi 4.7k Sep 30, 2022
Faiss is a library for efficient similarity search and clustering of dense vectors.

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.

Facebook Research 18.1k Sep 30, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 92 Sep 27, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

shrikumaran pb 4 Aug 14, 2021
Radeon Rays is ray intersection acceleration library for hardware and software multiplatforms using CPU and GPU

RadeonRays 4.1 Summary RadeonRays is a ray intersection acceleration library. AMD developed RadeonRays to help developers make the most of GPU and to

GPUOpen Libraries & SDKs 967 Sep 26, 2022
A hierarchical parameter server framework based on MXNet. GeoMX also implements multiple communication-efficient strategies.

Introduction GeoMX is a MXNet-based two-layer parameter server framework, aiming at integrating data knowledge that owned by multiple independent part

null 86 Jul 22, 2022
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
Deploy SCRFD, an efficient high accuracy face detection approach, in your web browser with ncnn and webassembly

ncnn-webassembly-scrfd open https://nihui.github.io/ncnn-webassembly-scrfd and enjoy build and deploy Install emscripten

null 41 Sep 19, 2022
An Efficient Implementation of Analytic Mesh Algorithm for 3D Iso-surface Extraction from Neural Networks

AnalyticMesh Analytic Marching is an exact meshing solution from neural networks. Compared to standard methods, it completely avoids geometric and top

Jiabao Lei 41 Sep 8, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

null 888 Sep 27, 2022