a language for fast, portable data-parallel computation



Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines. Halide currently targets:

  • CPU architectures: X86, ARM, MIPS, Hexagon, PowerPC
  • Operating systems: Linux, Windows, Mac OS X, Android, iOS, Qualcomm QuRT
  • GPU Compute APIs: CUDA, OpenCL, OpenGL Compute Shaders, Apple Metal, Microsoft Direct X 12

Rather than being a standalone programming language, Halide is embedded in C++. This means you write C++ code that builds an in-memory representation of a Halide pipeline using Halide's C++ API. You can then compile this representation to an object file, or JIT-compile it and run it in the same process. Halide also provides a Python binding that provides full support for writing Halide embedded in Python without C++.

For more detail about what Halide is, see http://halide-lang.org.

For API documentation see http://halide-lang.org/docs

To see some example code, look in the tutorials directory.

If you've acquired a full source distribution and want to build Halide, see the notes below.

Getting Halide

Binary tarballs

The latest version of Halide is Halide 11.0.1. We provide binary releases for many popular platforms and architectures, including 32/64-bit x86 Windows, 64-bit macOS, and 32/64-bit x86/ARM Ubuntu Linux. See the releases tab on the right (or click here).


If you use vcpkg to manage dependencies, you can install Halide via:

$ vcpkg install halide:x64-windows # or x64-linux/x64-osx

Note two caveats: first, at time of writing, MSVC mis-compiles LLVM on x86-windows, so Halide cannot be used in vcpkg on that platform at this time; second, vcpkg installs only the minimum Halide backends required to compile code for the active platform. If you want to include all the backends, you should install halide[target-all]:x64-windows instead. Note that since this will build LLVM, it will take a lot of disk space (up to 100GB).


Alternatively, if you use macOS, you can install Halide via Homebrew like so:

$ brew install halide

Other package managers

We are interested in bringing Halide 10 to other popular package managers and Linux distribution repositories including, but not limited to, Conan, Debian, Ubuntu (or PPA), CentOS/Fedora, and Arch. If you have experience publishing packages we would be happy to work with you!

If you are a maintainer of any other package distribution platform, we would be excited to work with you, too.

Building Halide with Make


Have llvm-9.0 (or greater) installed and run make in the root directory of the repository (where this README is).

Acquiring LLVM

At any point in time, building Halide requires either the latest stable version of LLVM, the previous stable version of LLVM, and trunk. At the time of writing, this means versions 11.0 and 10.0 are supported, but 9.0 is not. The commands llvm-config and clang must be somewhere in the path.

If your OS does not have packages for llvm, you can find binaries for it at http://llvm.org/releases/download.html. Download an appropriate package and then either install it, or at least put the bin subdirectory in your path. (This works well on OS X and Ubuntu.)

If you want to build it yourself, first check it out from GitHub:

% git clone --depth 1 --branch llvmorg-11.0.0 https://github.com/llvm/llvm-project.git

(If you want to build LLVM 10.x, use branch llvmorg-10.0.1; for current trunk, use main)

Then build it like so:

% cmake -DCMAKE_BUILD_TYPE=Release \
        -DLLVM_ENABLE_PROJECTS="clang;lld;clang-tools-extra" \
        -DLLVM_TARGETS_TO_BUILD="X86;ARM;NVPTX;AArch64;Mips;Hexagon" \
        -S llvm-project/llvm -B llvm-build
% cmake --build llvm-build
% cmake --install llvm-build --prefix llvm-install

then to point Halide to it:

% export LLVM_ROOT=$PWD/llvm-install
% export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config

Note that you must add clang to LLVM_ENABLE_PROJECTS; adding lld to LLVM_ENABLE_PROJECTS is only required when using WebAssembly, and adding clang-tools-extra is only necessary if you plan to contribute code to Halide (so that you can run clang-tidy on your pull requests). We recommend enabling both in all cases, to simplify builds. You can disable exception handling (EH) and RTTI if you don't want the Python bindings.

Building Halide with make

With LLVM_CONFIG set (or llvm-config in your path), you should be able to just run make in the root directory of the Halide source tree. make run_tests will run the JIT test suite, and make test_apps will make sure all the apps compile and run (but won't check their output).

There is no make install yet. If you want to make an install package, run make distrib.

Building Halide out-of-tree with make

If you wish to build Halide in a separate directory, you can do that like so:

% cd ..
% mkdir halide_build
% cd halide_build
% make -f ../Halide/Makefile

Building Halide with CMake

MacOS and Linux

Follow the above instructions to build LLVM or acquire a suitable binary release. Then change directory to the Halide repository and run:

% cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm -S . -B build
% cmake --build build

LLVM_DIR is the folder in the LLVM installation tree (do not use the build tree by mistake) that contains LLVMConfig.cmake. It is not required to set this variable if you have a suitable system-wide version installed. If you have multiple system-wide versions installed, you can specify the version with Halide_REQUIRE_LLVM_VERSION. Add -G Ninja if you prefer to build with the Ninja generator.


We suggest building with Visual Studio 2019. Your mileage may vary with earlier versions. Be sure to install the "C++ CMake tools for Windows" in the Visual Studio installer. For older versions of Visual Studio, do not install the CMake tools, but instead acquire CMake and Ninja from their respective project websites.

These instructions start from the D: drive. We assume this git repo is cloned to D:\Halide. We also assume that your shell environment is set up correctly. For a 64-bit build, run:

D:\> "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64

For a 32-bit build, run:

D:\> "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64_x86

Managing dependencies with vcpkg

The best way to get compatible dependencies on Windows is to use vcpkg. Install it like so:

D:\> git clone https://github.com/Microsoft/vcpkg.git
D:\> cd vcpkg
D:\> .\bootstrap-vcpkg.bat
D:\vcpkg> .\vcpkg integrate install
CMake projects should use: "-DCMAKE_TOOLCHAIN_FILE=D:/vcpkg/scripts/buildsystems/vcpkg.cmake"

Then install the libraries. For a 64-bit build, run:

D:\vcpkg> .\vcpkg install libpng:x64-windows libjpeg-turbo:x64-windows llvm[target-all,clang-tools-extra]:x64-windows

To support 32-bit builds, also run:

D:\vcpkg> .\vcpkg install libpng:x86-windows libjpeg-turbo:x86-windows llvm[target-all,clang-tools-extra]:x86-windows

Building Halide

Create a separate build tree and call CMake with vcpkg's toolchain. This will build in either 32-bit or 64-bit depending on the environment script (vcvars) that was run earlier.

D:\Halide> cmake -G Ninja ^
                 -DCMAKE_BUILD_TYPE=Release ^
                 -DCMAKE_TOOLCHAIN_FILE=D:/vcpkg/scripts/buildsystems/vcpkg.cmake ^
                 -S . -B build

Note: If building with Python bindings on 32-bit (enabled by default), be sure to point CMake to the installation path of a 32-bit Python 3. You can do this by specifying, for example: "-DPython3_ROOT_DIR=C:\Program Files (x86)\Python38-32".

Then run the build with:

D:\Halide> cmake --build build --config Release -j %NUMBER_OF_PROCESSORS%

To run all the tests:

D:\Halide> cd build
D:\Halide\build> ctest -C Release

Subsets of the tests can be selected with -L and include correctness, python, error, and the other directory names under /tests.

Building LLVM (optional)

Follow these steps if you want to build LLVM yourself. First, download LLVM's sources (these instructions use the latest 11.0 release)

D:\> git clone --depth 1 --branch llvmorg-11.0.0 https://github.com/llvm/llvm-project.git

For a 64-bit build, run:

D:\> cmake -G Ninja ^
           -DCMAKE_BUILD_TYPE=Release ^
           -DLLVM_ENABLE_PROJECTS=clang;lld;clang-tools-extra ^
           -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Mips;Hexagon ^
           -DLLVM_ENABLE_EH=ON ^
           -DLLVM_ENABLE_RTTI=ON ^
           -DLLVM_BUILD_32_BITS=OFF ^
           -S llvm-project\llvm -B llvm-build

For a 32-bit build, run:

D:\> cmake -G Ninja ^
           -DCMAKE_BUILD_TYPE=Release ^
           -DLLVM_ENABLE_PROJECTS=clang;lld;clang-tools-extra ^
           -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Mips;Hexagon ^
           -DLLVM_ENABLE_EH=ON ^
           -DLLVM_ENABLE_RTTI=ON ^
           -DLLVM_BUILD_32_BITS=ON ^
           -S llvm-project\llvm -B llvm32-build

Finally, run:

D:\> cmake --build llvm-build --config Release -j %NUMBER_OF_PROCESSORS%
D:\> cmake --install llvm-build --prefix llvm-install

You can substitute Debug for Release in the above cmake commands if you want a debug build. Make sure to add -DLLVM_DIR=D:/llvm-install/lib/cmake/llvm to the Halide CMake command to override vcpkg's LLVM.

MSBuild: If you want to build LLVM with MSBuild instead of Ninja, use -G "Visual Studio 16 2019" -Thost=x64 -A x64 or -G "Visual Studio 16 2019" -Thost=x64 -A Win32 in place of -G Ninja.

If all else fails...

Do what the build-bots do: https://buildbot.halide-lang.org/master/#/builders

If the column that best matches your system is red, then maybe things aren't just broken for you. If it's green, then you can click the "stdio" links in the latest build to see what commands the build bots run, and what the output was.

Some useful environment variables

HL_TARGET=... will set Halide's AOT compilation target.

HL_JIT_TARGET=... will set Halide's JIT compilation target.

HL_DEBUG_CODEGEN=1 will print out pseudocode for what Halide is compiling. Higher numbers will print more detail.

HL_NUM_THREADS=... specifies the number of threads to create for the thread pool. When the async scheduling directive is used, more threads than this number may be required and thus allocated. A maximum of 256 threads is allowed. (By default, the number of cores on the host is used.)

HL_TRACE_FILE=... specifies a binary target file to dump tracing data into (ignored unless at least one trace_ feature is enabled in HL_TARGET or HL_JIT_TARGET). The output can be parsed programmatically by starting from the code in utils/HalideTraceViz.cpp.

Using Halide on OSX

Precompiled Halide distributions are built using XCode's command-line tools with Apple clang 500.2.76. This means that we link against libc++ instead of libstdc++. You may need to adjust compiler options accordingly if you're using an older XCode which does not default to libc++.

Halide OpenGL/GLSL backend

TODO(https://github.com/halide/Halide/issues/5633): update this for OpenGLCompute, which is staying

Halide for Hexagon HVX

Halide supports offloading work to Qualcomm Hexagon DSP on Qualcomm Snapdragon 835 devices or newer. The Hexagon DSP provides a set of 128 byte vector instruction extensions - the Hexagon Vector eXtensions (HVX). HVX is well suited for image processing, and Halide for Hexagon HVX will generate the appropriate HVX vector instructions from a program authored in Halide.

Halide can be used to compile Hexagon object files directly, by using a target such as hexagon-32-qurt-hvx.

Halide can also be used to offload parts of a pipeline to Hexagon using the hexagon scheduling directive. To enable the hexagon scheduling directive, include the hvx target feature in your target. The currently supported combination of targets is to use the HVX target features with an x86 linux host (to use the simulator) or with an ARM android target (to use Hexagon DSP hardware). For examples of using the hexagon scheduling directive on both the simulator and a Hexagon DSP, see the blur example app.

To build and run an example app using the Hexagon target,

  1. Obtain and build trunk LLVM and Clang. (Earlier versions of LLVM may work but are not actively tested and thus not recommended.)
  2. Download and install the Hexagon SDK and Hexagon Tools. Hexagon SDK 3.4.1 or later is needed. Hexagon Tools 8.2 or later is needed.
  3. Build and run an example for Hexagon HVX

1. Obtain and build trunk LLVM and Clang

(Instructions given previous, just be sure to check out the master branch.)

2. Download and install the Hexagon SDK and Hexagon Tools

Go to https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools

  1. Select the Hexagon Series 600 Software and download the 3.4.1 version or later for Linux.
  2. untar the installer
  3. Run the extracted installer to install the Hexagon SDK and Hexagon Tools, selecting Installation of Hexagon SDK into /location/of/SDK/Hexagon_SDK/3.x and the Hexagon tools into /location/of/SDK/Hexagon_Tools/8.x
  4. Set an environment variable to point to the SDK installation location
    export SDK_LOC=/location/of/SDK

3. Build and run an example for Hexagon HVX

In addition to running Hexagon code on device, Halide also supports running Hexagon code on the simulator from the Hexagon tools.

To build and run the blur example in Halide/apps/blur on the simulator:

cd apps/blur
export HL_HEXAGON_SIM_REMOTE=../../src/runtime/hexagon_remote/bin/v62/hexagon_sim_remote
export HL_HEXAGON_TOOLS=$SDK_LOC/Hexagon_Tools/8.x/Tools/
LD_LIBRARY_PATH=../../src/runtime/hexagon_remote/bin/host/:$HL_HEXAGON_TOOLS/lib/iss/:. HL_TARGET=host-hvx make test

To build and run the blur example in Halide/apps/blur on Android:

To build the example for Android, first ensure that you have Android NDK r19b or later installed, and the ANDROID_NDK_ROOT environment variable points to it. (Note that Qualcomm Hexagon SDK v3.5.2 includes Android NDK r19c, which is fine.)

Now build and run the blur example using the script to run it on device:

HL_TARGET=arm-64-android-hvx ./adb_run_on_device.sh
  • Metaprogrammed simplifier rules

    Metaprogrammed simplifier rules

    I'm finding it very hard to write lots of new simplifier rules correctly. I want to make adding new rules more scalable in terms of programmer effort, simplifier stack usage, and simplifier runtime (in that order).

    This PR is a proof-of-concept of template meta-programming the simplifier rules, which improves all three factors. See the changes to Simplify::visit(Select) at the very bottom.


    • Much easier to read and add new rules
    • Uses way less stack space (216 bytes instead of 1126 bytes for the Select visitor stack frame)
    • Slightly faster, mostly due to the bespoke IRMatcher::equal
    • There's an opportunity to statically check properties of each rewrite rule (e.g. some measure of complexity decreases, implying there can be no loops).


    • If you do something wrong, the error messages are the usual template metaprogramming hell
    • The implementation is super ugly. Every time you see me doing something weird in IRMatch.h it's probably avoiding null checks, avoiding atomic increments or decrements on Exprs, or it's saving stack space.
    • Will be awkward to express some complex constraints in these rules. I'll probably have to extend the template classes a bit to cover things like: this and that must both be constants and one has to be larger than the other. I have some ideas for how to do this.
    opened by abadams 144
  • Move add_image_checks after bounds inference (See #3036)

    Move add_image_checks after bounds inference (See #3036)

    This seemingly minor change could have large consequences. I'll watch the buildbots. This would also need checking on Google pipelines.

    opened by abadams 74
  • Specialize branched loops

    Specialize branched loops

    Continuation of PR #469 inside the main repo.

    opened by abadams 53
  • Error running HelloHexagon

    Error running HelloHexagon

    Following the instruction in Readme to install Hexagon SDK, installed LLVM 4.0, recompile the halide sources using LLVM 4.0 again too. Using the Hexagon SDK installer I got from Qualcomm, there is only "HEXAGON_Tools/7.2.12" directory, which is different from the description in Readme of "Hexagon_Tools/8.0" I tried hard searching around and not able to find Hexagon tools 8.0 anywhere.

    Now if I adjust HL_HEXAGON_TOOLS accordingly to make things compile, I run into this error:

    HL_TARGET=host ./pipeline pipeline_cpu-host pipeline_cpu
    Target: x86-64-linux-avx-avx2-f16c-fma-sse41
    HL_TARGET=host-hvx_64 ./pipeline pipeline_hvx64-host pipeline_hvx64
    Target: x86-64-linux-avx-avx2-f16c-fma-hvx_64-sse41
    warning: unknown warning option '-Wno-override-module' [-Wunknown-warning-option]
    /tmp/hexrGI6K9.ll:2:1: error: expected top-level entity
    source_filename = "/usr/local/google/home/wilwong/halide/Halide-20160812/src/runtime/noos.cpp"
    1 warning and 1 error generated.
    Internal error at /usr/local/google/home/wilwong/halide/Halide-20160812/src/HexagonOffload.cpp:342 triggered by user code at ./pipeline.cpp:91:
    Condition failed: result == 0
    hexagon-clang failed
    make: *** [pipeline_hvx64-host.o] Aborted (core dumped)
    rm pipeline_cpu-host.o

    Search deeper, and apparently HexagonOffload.cpp is creating a temporary file and pass it to hexagon-clang, but the temporary file does not look like something I am familiar with (not C/C++).

    Anybody knows what is going wrong? Does it has to do with the wrong version of Hexagon Tools?


    opened by wiltswong 51
  • Add Generator class and support code

    Add Generator class and support code

    Generator is intended to be the preferred way to encapsulate Func building in user pipelines. This pull request probably could use more documentation, but the overall design and set of tests is solid enough to begin serious discussion.

    opened by steven-johnson 48
  • CMake build system fixes

    CMake build system fixes

    This PR contains several fixes to CMake build system. Note build of apps is disabled because building them is still broken.

    The main highlights are:

    • A more sensible way of detecting and using LLVM (using its CMake Config file)
    • Support for building Doxygen.
    • Tutorials now build.

    There are still lots of things that need to be done to bring the CMake build system to parity with the Makefile build system but hopefully we'll be able to remove it eventually because I don't think maintaining two build systems is a good idea.

    opened by delcypher 47
  • New Boost.Python interface

    New Boost.Python interface

    This branch is now feature complete, see bottom of the thread for updates/discussions.

    Following the spirit of other repositories this is a pull request of an ongoing branch. The code is not ready for merge, but this pull request enables to have an ongoing discussion to guide the development.

    The rationale for this new branch is explained in the readme.text, basically current python bindings rely in a broken tool (swig), has too much spagethi (see __init__.py) and not enough link to the C++ codebase.

    The current status is: a) Proof of concept of the approach is valided (see d), b) ~~70%~~ ~~80%~~ ~~90%~~ all of the ground work is in place (~~most notable missing pieces is the gpu API, Tuple, RDom~~), c) Code compiles and runs, d) blur.py runs and blurs the image as desired (erode.py, bilateral_grid.py also works). ~~e) No real unit-tests in place (see Q4).~~

    On principle code should work perfectly fine on python 2 and 3; but I have only tested for python 3.

    Some of feedback I would like to have: Q1) Likelihood of a merge with master once API/tests coverage is good enough.

    Q2) Suggested strategy to better address the drift between Python bindings and C++ code base (i.e. how to include in continuous integration).

    Q3) Current code adds dependency with Boost.Numpy (which is not part of boost) for convenient I/O. I am considering including a copy into the repository, opinions on the cleanest way of handling the dependency?

    Q4) Suggestions for the best testing approach. For now I will focus on porting the demonstration apps from the old python bindings, and covering the areas I will be using; but I guess we could do better.

    Q5) Anyone interested on giving a hand ? Second brain helps make code cleaner. Also, I would not mind delegating the gpu API part (which I am not planning to use in the short term).

    opened by rodrigob 46
  • ABI issue with LLVM11 and D3D12 (__stdcall)

    ABI issue with LLVM11 and D3D12 (__stdcall)

    The following code snippet from d3d12compute.cpp:

    hCPU.ptr += i * descriptorSize;
    (*device)->CreateUnorderedAccessView(NULL, NULL, &NullDescUAV, hCPU);

    produces the following assembly

    LLVM 10                                               LLVM 11
       0x2c619a06bf1:       mov    %r14,-0x1d0(%rbp)         0x19f89f06c01:       mov    %r14,-0x1e0(%rbp)
       ...                                                   ...
       0x2c619a06c09:       mov    %rcx,-0x220(%rbp)         0x19f89f06c19:       mov    %rcx,-0x1d0(%rbp)    <------
                                                             0x19f89f06c20:       lea    -0x1d0(%rbp),%rcx
       0x2c619a06c10:       mov    %rcx,0x20(%rsp)           0x19f89f06c27:       mov    %rcx,0x20(%rsp)
       ...                                                   ...
                            callq  *%rax                                          callq  *%rax

    On the left, the value of hCPU.ptr is in %rcx, which is then placed on the stack (0x20(%rsp)) to serve as an argument to the subsequent callq to CreateUnorderedAccessView. This is the correct ABI behavior.

    On the right, LLVM11 decided to replace the value of %rcx with the address of that value (notice the lea instruction). This is basically turning something that should have been passed by-value into something that is passed by-reference

    In short, LLVM 11 is insisting in passing the struct by reference, even though it fits entirely on a 64bit word and should have been passed by-value:

    typedef struct D3D12_CPU_DESCRIPTOR_HANDLE {
        SIZE_T ptr;
    opened by slomp 45
  • added initial source code support for AMDGPU backend

    added initial source code support for AMDGPU backend

    Initial source port.

    opened by adityaatluri 41
  • Update CMake to use modern features in testing.

    Update CMake to use modern features in testing.

    This is something of a big PR, but it only touches the CMake build, with one notable exception: apps/support/cmdline.h has been modified to (crudely) support non-RTTI builds.

    The main contribution is this: pseudo-targets for running tests have been removed in favor of using CTest, which has a number of advantages. Notably, it has a native notion of test labels, which allow us to define and select groups of tests to run. It also allows developers to run only those tests which failed previously; this is especially useful after an incremental build.

    The CMake minimum version requirement has been bumped from 3.3 to 3.14. A lot has changed in CMake in recent years and an initial effort has been made to modernize. For instance, the halide_use_image_io function has been replaced with an alias target Halide::ImageIO to which a normal target can link. A subsequent PR will focus on modernizing the rest of the CMake build so we can be more easily integrated into other projects, packaged by popular package managers, and installed into standard system locations.

    A large number of new lines list source files. Globbing source files in CMake breaks incremental builds and leads to frustrating scenarios where the lack of changes to CMakeLists after a pull requires a developer to do a full rebuild. The CMake developers strongly caution against source file globbing. It is the prevailing opinion on StackOverflow and in talks from the maintainers.

    Expect build failures while I learn how Travis works. Will also need to ask about the buildbots.

    opened by alexreinking 39
  • Minor cleanup of parallel refactor intrinsics

    Minor cleanup of parallel refactor intrinsics

    • Renamed load_struct_member to load_typed_struct_member to make it more clear that it is intended for use only with the results of make_typed_struct.
    • Split declare_struct_type into two intrinsics, define_typed_struct and forward_declare_typed_struct, removing the need for the underdocumented mode argument and hopefully making usage clearer
    • Added / clarified comments for the intrinsics modified above
    opened by steven-johnson 0
  • We should warn on a bound + compute_at combination that makes no sense

    We should warn on a bound + compute_at combination that makes no sense

    These are easy mistakes to make, and they blow up the computational complexity of the algorithm:

    Func f, g;
    f(x) = x;
    g(x) = f(x);
    f.bound(x, 0, width); // Added while everything was compute_root
    g.bound(x, 0, width);
    f.compute_at(g, x); // Added later, but oops, this makes the bound silly
    opened by abadams 2
  • Can't use in() on generator outputs.

    Can't use in() on generator outputs.

    I would like to be able to write the following, where output is an Output<Buffer<>> member of a generator:

    output(x, y) += foo;
    output.compute_at(output.in(), y)

    This would need some sort of checking if a pipeline output has a global wrapper in early lowering, and if so, promoting that wrapper to be the actual output.

    opened by abadams 6
  • Does halide support RVar async?

    Does halide support RVar async?

    I want to execute compute and write memory concurrently:

    x_idx(x, y) = calculate1(input, x, y); y_idx(x, y) = calculate2(intput, x, y);
    using halide, calcuating x_idx and y_idx spends 4ms.

    for(int i = 0; i < input.height(); i++) for(int j = 0; j < intput.width();j++) output(x_idx(i,j), y_idx(i,j)) = input(i, j); using c code, spend 10ms, cannot use multithread. The equivalent halide code below: RDom r(0, input.width(), 0, input.height()); output(x_idx(r.x, r.y), y_idx(r.x, r.y)) = input(r.x, r.y); To write memory as soon as compute 4 pixel(neon 128), can I schedule RVar using async like this x_idx.store_root().compute_at(output, x).async? Expect you reply. Thanks!

    opened by mym2009 0
  • We should unconditionally use fast_integer_divide for vector division by uint8s

    We should unconditionally use fast_integer_divide for vector division by uint8s

    According to the performance test it's unconditionally faster for vectors, by at least 5x. We could pattern match a / cast(..., some_uint8) where a is a vector type in the find_intrinsics pass.

    division rounding to negative infinity:
    type            const-divisor speed-up  runtime-divisor speed-up
     Int(32,  1)     2.662                   1.101
     Int(16,  1)     3.022                   1.663
     Int( 8,  1)     1.408                   1.080
    UInt(32,  1)     2.570                   1.706
    UInt(16,  1)     3.068                   1.450
    UInt( 8,  1)     2.987                   1.456
     Int(32,  8)    10.722                   7.991
     Int(16, 16)    46.577                  30.900
     Int( 8, 32)    25.602                   8.292
    UInt(32,  8)     8.115                   5.423
    UInt(16, 16)    24.296                  13.680
    UInt( 8, 32)    42.669                  19.993
    signed division rounding to zero:
    type            const-divisor speed-up  runtime-divisor speed-up
     Int(32,  1)     2.402                   1.155
     Int(16,  1)     2.537                   1.453
     Int( 8,  1)     1.774                   0.680
     Int(32,  8)     8.517                   5.975
     Int(16, 16)    52.965                  38.595
     Int( 8, 32)    19.745                   8.318
    type            const-divisor speed-up  runtime-divisor speed-up
     Int(32,  1)     2.394                   1.143
     Int(16,  1)     2.536                   1.503
     Int( 8,  1)     1.755                   0.671
    UInt(32,  1)     2.279                   1.690
    UInt(16,  1)     2.659                   1.594
    UInt( 8,  1)     2.567                   1.212
     Int(32,  8)     8.296                   5.696
     Int(16, 16)    53.311                  32.092
     Int( 8, 32)    19.439                   8.173
    UInt(32,  8)     6.009                   5.103
    UInt(16, 16)    19.090                  12.386
    UInt( 8, 32)    22.973                  15.043
    opened by abadams 2
  • Add a fast integer divide that rounds to zero

    Add a fast integer divide that rounds to zero

    While working on legacy code I discovered a need for this. Performance test shows a good speed-up over native division for vector code:

    signed division rounding to zero:
    type            const-divisor speed-up  runtime-divisor speed-up
     Int(32,  1)     2.416                   1.153
     Int(16,  1)     2.552                   1.457
     Int( 8,  1)     1.782                   0.667
     Int(32,  8)     8.592                   5.908
     Int(16, 16)    53.008                  38.505
     Int( 8, 32)    19.480                   8.197
    opened by abadams 2
  • Documenation for .in() refers to interpolate app, but is rewritten since.

    Documenation for .in() refers to interpolate app, but is rewritten since.

    The app/interpolate no longer uses the .in() directive. A new app should be chosen to guide the reader to a useful example. https://github.com/halide/Halide/blob/c0192ffa71bbebfbdcb6eddcdf060169f5022ea2/src/Func.h#L1313-L1316

    While we are at .in() (again with FAQs efforts in mind), I'd like to also hear about the technique of copying memory into a SM's shared memory for improved performance. There is a trick in the apps somewhere that uses .in().in() to achieve this. I think this needs extensive elaboration: https://github.com/halide/Halide/blob/c0192ffa71bbebfbdcb6eddcdf060169f5022ea2/apps/stencil_chain/stencil_chain_generator.cpp#L86-L101

    I'm slowly getting the hang of what .in() does, but this I don't get. It seems that the first block is meant to copy it to block Shared Memory, and then the second one (the one embedded in code here) is meant to load it into registers? Maybe I'm not familiar with how CUDA works, but how can a function be loaded into registers? Every value goes into a register? Why do you know this in this case? Doesn't there need to be a .store_in(MemoryType::Register) then? Same for the loading in the shared memory: doesn't it need a .store_in(MemoryType::GPUShared)?

    opened by mcourteaux 0
  • Add copysign intrinsic

    Add copysign intrinsic

    These should exist for integers and floats. For integers x86 has instructions to copy the sign over from one vector of ints to another, but we have no way to target them. They are exposed by llvm as intrinsics.

    One weirdness is that the integer versions have odd behavior when the arg is zero. I think we'd prefer to treat zero as positive.

    enhancement performance 
    opened by abadams 2
  • Parsing the target string

    Parsing the target string "host-cuda" creates a JITModule and a cuda context, which is surprising

    It does this because it needs a context (and a GPU device selected) to know what cuda capabilities the host system has in order to resolve it to something like x86-64-avx2-linux-cuda-cuda_capability_31

    One way we could make this less surprising is delaying this resolution until the first time we actually need to know the cuda version, likely in lowering. We'd parse host-cuda into x86-64-avx2-linux-cuda-cuda_capability_host, and then at some point host would get converted into a concrete cuda capability. This is weird for different reasons though - you have this intermediate target. On the other hand, it would let you specify x86-64-avx2-linux-cuda-cuda_capability_host, which says sniff the cuda target but not the cpu. I don't think we can currently do that.

    opened by abadams 7
  • Make/CMake should check the HVX SDK version in use

    Make/CMake should check the HVX SDK version in use

    Our README documents that we require at least HVX SDK 4.3.0, but nothing in our system appears to check this, and I can locally build everything with 3.5.x (and probably earlier). We should add some logic to check the version if this is indeed a requirement.

    opened by steven-johnson 0
C++ library for geographical raster data analysis

Pronto Raster library The Pronto Raster Library is a C++ library to work with raster data. The core idea of the library is to make raster data accessi

Alex Hagen-Zanker 39 Oct 6, 2021
Earth observation data cubes from GDAL image collections

gdalcubes - Earth observation data cubes from GDAL image collections gdalcubes is a library to represent collections of Earth Observation (EO) images

Marius Appel 67 Nov 23, 2021
A thin, highly portable C++ intermediate representation for dense loop-based computation.

A thin, highly portable C++ intermediate representation for dense loop-based computation.

Facebook Research 46 Nov 15, 2021
A language and editor for scientific computation

Forscape A language and editor for scientific computation Focus on the problem, not the implementation details Forscape solves engineering problems wi

John Till 2 Nov 28, 2021
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.2k Dec 3, 2021
Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as task graphs that are scheduled concurrently and asynchronously on both CPUs and GPUs.

Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as tasks in a graph structure, where edges represent task dependencies

null 19 Nov 16, 2021
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 739 Dec 1, 2021
A easy to use multithreading thread pool library for C. It is a handy stream like job scheduler with an automatic garbage collector. This is a multithreaded job scheduler for non I/O bound computation.

A easy to use multithreading thread pool library for C. It is a handy stream-like job scheduler with an automatic garbage collector for non I/O bound computation.

Hyoung Min Suh 11 Mar 6, 2021
SecMML: Secure MPC(multi-party computation) Machine Learning Framework

SecMML 介绍 SecMML是FudanMPL(Multi-Party Computation + Machine Learning)的一个分支,是用于训练机器学习模型的高效可扩展的安全多方计算(MPC)框架,基于BGW协议实现。此框架可以应用到三个及以上参与方联合训练的场景中。目前,SecMM

null 56 Nov 23, 2021
A framework for generic hybrid two-party computation and private inference with neural networks

MOTION2NX -- A Framework for Generic Hybrid Two-Party Computation and Private Inference with Neural Networks This software is an extension of the MOTI

ENCRYPTO 3 Nov 26, 2021
Fast parallel CTC.

In Chinese 中文版 warp-ctc A fast parallel implementation of CTC, on both CPU and GPU. Introduction Connectionist Temporal Classification is a loss funct

Baidu Research 4k Nov 28, 2021
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Parallella 289 Dec 6, 2021
A fast, portable, simple, and free C/C++ IDE

A fast, portable, simple, and free C/C++ IDE. Dev C++ has been downloaded over 67,796,885 times since 2000.

Embarcadero Technologies 1.4k Dec 2, 2021
Skylark Edit is a customizable text/hex editor. Small, Portable, Fast.

Skylark Edit is written in C, a high performance text/hex editor. Embedded Database-client/Redis-client/Lua-engine. You can run Lua scripts and SQL files directly.

hua andy 8 Nov 14, 2021
This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer (short for FST) in c++ language.This FST C++ open source project has much significant advantages.

Orchid-Fst 1. Project Overview This project Orchid-Fst implements a fast text string dictionary search data structure: Finite state transducer , which

Bin Ding 7 Nov 16, 2021
Microsoft 2.4k Nov 28, 2021
Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths.

WDT Warp speed Data Transfer Design philosophy/Overview Goal: Lowest possible total transfer time - to be only hardware limited (disc or network bandw

Facebook 2.6k Dec 7, 2021
A General-purpose Parallel and Heterogeneous Task Programming System

Taskflow Taskflow helps you quickly write parallel and heterogeneous tasks programs in modern C++ Why Taskflow? Taskflow is faster, more expressive, a

Taskflow 6.1k Dec 2, 2021
Kokkos C++ Performance Portability Programming EcoSystem: The Programming Model - Parallel Execution and Memory Abstraction

Kokkos: Core Libraries Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platfor

Kokkos 896 Dec 4, 2021
Powerful multi-threaded coroutine dispatcher and parallel execution engine

Quantum Library : A scalable C++ coroutine framework Quantum is a full-featured and powerful C++ framework build on top of the Boost coroutine library

Bloomberg 370 Nov 30, 2021
Parallel, indexed xz compressor

pixz Pixz (pronounced pixie) is a parallel, indexing version of xz. Repository: https://github.com/vasi/pixz Downloads: https://github.com/vasi/pixz/r

Dave Vasilevsky 562 Dec 4, 2021
Parallel, indexed xz compressor

pixz Pixz (pronounced pixie) is a parallel, indexing version of xz. Repository: https://github.com/vasi/pixz Downloads: https://github.com/vasi/pixz/r

Dave Vasilevsky 561 Nov 28, 2021
C++ Parallel Computing and Asynchronous Networking Engine

中文版入口 Sogou C++ Workflow As Sogou`s C++ server engine, Sogou C++ Workflow supports almost all back-end C++ online services of Sogou, including all sea

Sogou-inc 6.4k Nov 30, 2021
ParaMonte: Plain Powerful Parallel Monte Carlo and MCMC Library for Python, MATLAB, Fortran, C++, C.

Overview | Installation | Dependencies | Parallelism | Examples | Acknowledgments | License | Authors ParaMonte: Plain Powerful Parallel Monte Carlo L

Computational Data Science Lab 141 Nov 24, 2021
Material for the UIBK Parallel Programming Lab (2021)

UIBK PS Parallel Systems (703078, 2021) This repository contains material required to complete exercises for the Parallel Programming lab in the 2021

null 14 Nov 17, 2021
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish: MONOlithic LIner equation Solvers for Highly-parallel architecture monolish is a linear equation solver library that monolithically fuses va

RICOS Co. Ltd. 136 Dec 4, 2021