a language for fast, portable data-parallel computation



Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines. Halide currently targets:

  • CPU architectures: X86, ARM, MIPS, Hexagon, PowerPC
  • Operating systems: Linux, Windows, Mac OS X, Android, iOS, Qualcomm QuRT
  • GPU Compute APIs: CUDA, OpenCL, OpenGL Compute Shaders, Apple Metal, Microsoft Direct X 12

Rather than being a standalone programming language, Halide is embedded in C++. This means you write C++ code that builds an in-memory representation of a Halide pipeline using Halide's C++ API. You can then compile this representation to an object file, or JIT-compile it and run it in the same process. Halide also provides a Python binding that provides full support for writing Halide embedded in Python without C++.

For more detail about what Halide is, see http://halide-lang.org.

For API documentation see http://halide-lang.org/docs

To see some example code, look in the tutorials directory.

If you've acquired a full source distribution and want to build Halide, see the notes below.

Getting Halide

Binary tarballs

The latest version of Halide is Halide 11.0.1. We provide binary releases for many popular platforms and architectures, including 32/64-bit x86 Windows, 64-bit macOS, and 32/64-bit x86/ARM Ubuntu Linux. See the releases tab on the right (or click here).


If you use vcpkg to manage dependencies, you can install Halide via:

$ vcpkg install halide:x64-windows # or x64-linux/x64-osx

Note two caveats: first, at time of writing, MSVC mis-compiles LLVM on x86-windows, so Halide cannot be used in vcpkg on that platform at this time; second, vcpkg installs only the minimum Halide backends required to compile code for the active platform. If you want to include all the backends, you should install halide[target-all]:x64-windows instead. Note that since this will build LLVM, it will take a lot of disk space (up to 100GB).


Alternatively, if you use macOS, you can install Halide via Homebrew like so:

$ brew install halide

Other package managers

We are interested in bringing Halide 10 to other popular package managers and Linux distribution repositories including, but not limited to, Conan, Debian, Ubuntu (or PPA), CentOS/Fedora, and Arch. If you have experience publishing packages we would be happy to work with you!

If you are a maintainer of any other package distribution platform, we would be excited to work with you, too.

Building Halide with Make


Have llvm-9.0 (or greater) installed and run make in the root directory of the repository (where this README is).

Acquiring LLVM

At any point in time, building Halide requires either the latest stable version of LLVM, the previous stable version of LLVM, and trunk. At the time of writing, this means versions 11.0 and 10.0 are supported, but 9.0 is not. The commands llvm-config and clang must be somewhere in the path.

If your OS does not have packages for llvm, you can find binaries for it at http://llvm.org/releases/download.html. Download an appropriate package and then either install it, or at least put the bin subdirectory in your path. (This works well on OS X and Ubuntu.)

If you want to build it yourself, first check it out from GitHub:

% git clone --depth 1 --branch llvmorg-11.0.0 https://github.com/llvm/llvm-project.git

(If you want to build LLVM 10.x, use branch llvmorg-10.0.1; for current trunk, use main)

Then build it like so:

% cmake -DCMAKE_BUILD_TYPE=Release \
        -DLLVM_ENABLE_PROJECTS="clang;lld;clang-tools-extra" \
        -DLLVM_TARGETS_TO_BUILD="X86;ARM;NVPTX;AArch64;Mips;Hexagon" \
        -S llvm-project/llvm -B llvm-build
% cmake --build llvm-build
% cmake --install llvm-build --prefix llvm-install

then to point Halide to it:

% export LLVM_ROOT=$PWD/llvm-install
% export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config

Note that you must add clang to LLVM_ENABLE_PROJECTS; adding lld to LLVM_ENABLE_PROJECTS is only required when using WebAssembly, and adding clang-tools-extra is only necessary if you plan to contribute code to Halide (so that you can run clang-tidy on your pull requests). We recommend enabling both in all cases, to simplify builds. You can disable exception handling (EH) and RTTI if you don't want the Python bindings.

Building Halide with make

With LLVM_CONFIG set (or llvm-config in your path), you should be able to just run make in the root directory of the Halide source tree. make run_tests will run the JIT test suite, and make test_apps will make sure all the apps compile and run (but won't check their output).

There is no make install yet. If you want to make an install package, run make distrib.

Building Halide out-of-tree with make

If you wish to build Halide in a separate directory, you can do that like so:

% cd ..
% mkdir halide_build
% cd halide_build
% make -f ../Halide/Makefile

Building Halide with CMake

MacOS and Linux

Follow the above instructions to build LLVM or acquire a suitable binary release. Then change directory to the Halide repository and run:

% cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm -S . -B build
% cmake --build build

LLVM_DIR is the folder in the LLVM installation tree (do not use the build tree by mistake) that contains LLVMConfig.cmake. It is not required to set this variable if you have a suitable system-wide version installed. If you have multiple system-wide versions installed, you can specify the version with Halide_REQUIRE_LLVM_VERSION. Add -G Ninja if you prefer to build with the Ninja generator.


We suggest building with Visual Studio 2019. Your mileage may vary with earlier versions. Be sure to install the "C++ CMake tools for Windows" in the Visual Studio installer. For older versions of Visual Studio, do not install the CMake tools, but instead acquire CMake and Ninja from their respective project websites.

These instructions start from the D: drive. We assume this git repo is cloned to D:\Halide. We also assume that your shell environment is set up correctly. For a 64-bit build, run:

D:\> "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64

For a 32-bit build, run:

D:\> "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64_x86

Managing dependencies with vcpkg

The best way to get compatible dependencies on Windows is to use vcpkg. Install it like so:

D:\> git clone https://github.com/Microsoft/vcpkg.git
D:\> cd vcpkg
D:\> .\bootstrap-vcpkg.bat
D:\vcpkg> .\vcpkg integrate install
CMake projects should use: "-DCMAKE_TOOLCHAIN_FILE=D:/vcpkg/scripts/buildsystems/vcpkg.cmake"

Then install the libraries. For a 64-bit build, run:

D:\vcpkg> .\vcpkg install libpng:x64-windows libjpeg-turbo:x64-windows llvm[target-all,clang-tools-extra]:x64-windows

To support 32-bit builds, also run:

D:\vcpkg> .\vcpkg install libpng:x86-windows libjpeg-turbo:x86-windows llvm[target-all,clang-tools-extra]:x86-windows

Building Halide

Create a separate build tree and call CMake with vcpkg's toolchain. This will build in either 32-bit or 64-bit depending on the environment script (vcvars) that was run earlier.

D:\Halide> cmake -G Ninja ^
                 -DCMAKE_BUILD_TYPE=Release ^
                 -DCMAKE_TOOLCHAIN_FILE=D:/vcpkg/scripts/buildsystems/vcpkg.cmake ^
                 -S . -B build

Note: If building with Python bindings on 32-bit (enabled by default), be sure to point CMake to the installation path of a 32-bit Python 3. You can do this by specifying, for example: "-DPython3_ROOT_DIR=C:\Program Files (x86)\Python38-32".

Then run the build with:

D:\Halide> cmake --build build --config Release -j %NUMBER_OF_PROCESSORS%

To run all the tests:

D:\Halide> cd build
D:\Halide\build> ctest -C Release

Subsets of the tests can be selected with -L and include correctness, python, error, and the other directory names under /tests.

Building LLVM (optional)

Follow these steps if you want to build LLVM yourself. First, download LLVM's sources (these instructions use the latest 11.0 release)

D:\> git clone --depth 1 --branch llvmorg-11.0.0 https://github.com/llvm/llvm-project.git

For a 64-bit build, run:

D:\> cmake -G Ninja ^
           -DCMAKE_BUILD_TYPE=Release ^
           -DLLVM_ENABLE_PROJECTS=clang;lld;clang-tools-extra ^
           -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Mips;Hexagon ^
           -DLLVM_ENABLE_EH=ON ^
           -DLLVM_ENABLE_RTTI=ON ^
           -DLLVM_BUILD_32_BITS=OFF ^
           -S llvm-project\llvm -B llvm-build

For a 32-bit build, run:

D:\> cmake -G Ninja ^
           -DCMAKE_BUILD_TYPE=Release ^
           -DLLVM_ENABLE_PROJECTS=clang;lld;clang-tools-extra ^
           -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Mips;Hexagon ^
           -DLLVM_ENABLE_EH=ON ^
           -DLLVM_ENABLE_RTTI=ON ^
           -DLLVM_BUILD_32_BITS=ON ^
           -S llvm-project\llvm -B llvm32-build

Finally, run:

D:\> cmake --build llvm-build --config Release -j %NUMBER_OF_PROCESSORS%
D:\> cmake --install llvm-build --prefix llvm-install

You can substitute Debug for Release in the above cmake commands if you want a debug build. Make sure to add -DLLVM_DIR=D:/llvm-install/lib/cmake/llvm to the Halide CMake command to override vcpkg's LLVM.

MSBuild: If you want to build LLVM with MSBuild instead of Ninja, use -G "Visual Studio 16 2019" -Thost=x64 -A x64 or -G "Visual Studio 16 2019" -Thost=x64 -A Win32 in place of -G Ninja.

If all else fails...

Do what the build-bots do: https://buildbot.halide-lang.org/master/#/builders

If the column that best matches your system is red, then maybe things aren't just broken for you. If it's green, then you can click the "stdio" links in the latest build to see what commands the build bots run, and what the output was.

Some useful environment variables

HL_TARGET=... will set Halide's AOT compilation target.

HL_JIT_TARGET=... will set Halide's JIT compilation target.

HL_DEBUG_CODEGEN=1 will print out pseudocode for what Halide is compiling. Higher numbers will print more detail.

HL_NUM_THREADS=... specifies the number of threads to create for the thread pool. When the async scheduling directive is used, more threads than this number may be required and thus allocated. A maximum of 256 threads is allowed. (By default, the number of cores on the host is used.)

HL_TRACE_FILE=... specifies a binary target file to dump tracing data into (ignored unless at least one trace_ feature is enabled in HL_TARGET or HL_JIT_TARGET). The output can be parsed programmatically by starting from the code in utils/HalideTraceViz.cpp.

Using Halide on OSX

Precompiled Halide distributions are built using XCode's command-line tools with Apple clang 500.2.76. This means that we link against libc++ instead of libstdc++. You may need to adjust compiler options accordingly if you're using an older XCode which does not default to libc++.

Halide OpenGL/GLSL backend

TODO(https://github.com/halide/Halide/issues/5633): update this for OpenGLCompute, which is staying

Halide for Hexagon HVX

Halide supports offloading work to Qualcomm Hexagon DSP on Qualcomm Snapdragon 835 devices or newer. The Hexagon DSP provides a set of 128 byte vector instruction extensions - the Hexagon Vector eXtensions (HVX). HVX is well suited for image processing, and Halide for Hexagon HVX will generate the appropriate HVX vector instructions from a program authored in Halide.

Halide can be used to compile Hexagon object files directly, by using a target such as hexagon-32-qurt-hvx.

Halide can also be used to offload parts of a pipeline to Hexagon using the hexagon scheduling directive. To enable the hexagon scheduling directive, include the hvx target feature in your target. The currently supported combination of targets is to use the HVX target features with an x86 linux host (to use the simulator) or with an ARM android target (to use Hexagon DSP hardware). For examples of using the hexagon scheduling directive on both the simulator and a Hexagon DSP, see the blur example app.

To build and run an example app using the Hexagon target,

  1. Obtain and build trunk LLVM and Clang. (Earlier versions of LLVM may work but are not actively tested and thus not recommended.)
  2. Download and install the Hexagon SDK and Hexagon Tools. Hexagon SDK 3.4.1 or later is needed. Hexagon Tools 8.2 or later is needed.
  3. Build and run an example for Hexagon HVX

1. Obtain and build trunk LLVM and Clang

(Instructions given previous, just be sure to check out the master branch.)

2. Download and install the Hexagon SDK and Hexagon Tools

Go to https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools

  1. Select the Hexagon Series 600 Software and download the 3.4.1 version or later for Linux.
  2. untar the installer
  3. Run the extracted installer to install the Hexagon SDK and Hexagon Tools, selecting Installation of Hexagon SDK into /location/of/SDK/Hexagon_SDK/3.x and the Hexagon tools into /location/of/SDK/Hexagon_Tools/8.x
  4. Set an environment variable to point to the SDK installation location
    export SDK_LOC=/location/of/SDK

3. Build and run an example for Hexagon HVX

In addition to running Hexagon code on device, Halide also supports running Hexagon code on the simulator from the Hexagon tools.

To build and run the blur example in Halide/apps/blur on the simulator:

cd apps/blur
export HL_HEXAGON_SIM_REMOTE=../../src/runtime/hexagon_remote/bin/v62/hexagon_sim_remote
export HL_HEXAGON_TOOLS=$SDK_LOC/Hexagon_Tools/8.x/Tools/
LD_LIBRARY_PATH=../../src/runtime/hexagon_remote/bin/host/:$HL_HEXAGON_TOOLS/lib/iss/:. HL_TARGET=host-hvx make test

To build and run the blur example in Halide/apps/blur on Android:

To build the example for Android, first ensure that you have Android NDK r19b or later installed, and the ANDROID_NDK_ROOT environment variable points to it. (Note that Qualcomm Hexagon SDK v3.5.2 includes Android NDK r19c, which is fine.)

Now build and run the blur example using the script to run it on device:

HL_TARGET=arm-64-android-hvx ./adb_run_on_device.sh
  • Metaprogrammed simplifier rules

    Metaprogrammed simplifier rules

    I'm finding it very hard to write lots of new simplifier rules correctly. I want to make adding new rules more scalable in terms of programmer effort, simplifier stack usage, and simplifier runtime (in that order).

    This PR is a proof-of-concept of template meta-programming the simplifier rules, which improves all three factors. See the changes to Simplify::visit(Select) at the very bottom.


    • Much easier to read and add new rules
    • Uses way less stack space (216 bytes instead of 1126 bytes for the Select visitor stack frame)
    • Slightly faster, mostly due to the bespoke IRMatcher::equal
    • There's an opportunity to statically check properties of each rewrite rule (e.g. some measure of complexity decreases, implying there can be no loops).


    • If you do something wrong, the error messages are the usual template metaprogramming hell
    • The implementation is super ugly. Every time you see me doing something weird in IRMatch.h it's probably avoiding null checks, avoiding atomic increments or decrements on Exprs, or it's saving stack space.
    • Will be awkward to express some complex constraints in these rules. I'll probably have to extend the template classes a bit to cover things like: this and that must both be constants and one has to be larger than the other. I have some ideas for how to do this.
    opened by abadams 144
  • Error running HelloHexagon

    Error running HelloHexagon

    Following the instruction in Readme to install Hexagon SDK, installed LLVM 4.0, recompile the halide sources using LLVM 4.0 again too. Using the Hexagon SDK installer I got from Qualcomm, there is only "HEXAGON_Tools/7.2.12" directory, which is different from the description in Readme of "Hexagon_Tools/8.0" I tried hard searching around and not able to find Hexagon tools 8.0 anywhere.

    Now if I adjust HL_HEXAGON_TOOLS accordingly to make things compile, I run into this error:

    HL_TARGET=host ./pipeline pipeline_cpu-host pipeline_cpu
    Target: x86-64-linux-avx-avx2-f16c-fma-sse41
    HL_TARGET=host-hvx_64 ./pipeline pipeline_hvx64-host pipeline_hvx64
    Target: x86-64-linux-avx-avx2-f16c-fma-hvx_64-sse41
    warning: unknown warning option '-Wno-override-module' [-Wunknown-warning-option]
    /tmp/hexrGI6K9.ll:2:1: error: expected top-level entity
    source_filename = "/usr/local/google/home/wilwong/halide/Halide-20160812/src/runtime/noos.cpp"
    1 warning and 1 error generated.
    Internal error at /usr/local/google/home/wilwong/halide/Halide-20160812/src/HexagonOffload.cpp:342 triggered by user code at ./pipeline.cpp:91:
    Condition failed: result == 0
    hexagon-clang failed
    make: *** [pipeline_hvx64-host.o] Aborted (core dumped)
    rm pipeline_cpu-host.o

    Search deeper, and apparently HexagonOffload.cpp is creating a temporary file and pass it to hexagon-clang, but the temporary file does not look like something I am familiar with (not C/C++).

    Anybody knows what is going wrong? Does it has to do with the wrong version of Hexagon Tools?


    opened by wiltswong 51
  • Add Generator class and support code

    Add Generator class and support code

    Generator is intended to be the preferred way to encapsulate Func building in user pipelines. This pull request probably could use more documentation, but the overall design and set of tests is solid enough to begin serious discussion.

    opened by steven-johnson 48
  • CMake build system fixes

    CMake build system fixes

    This PR contains several fixes to CMake build system. Note build of apps is disabled because building them is still broken.

    The main highlights are:

    • A more sensible way of detecting and using LLVM (using its CMake Config file)
    • Support for building Doxygen.
    • Tutorials now build.

    There are still lots of things that need to be done to bring the CMake build system to parity with the Makefile build system but hopefully we'll be able to remove it eventually because I don't think maintaining two build systems is a good idea.

    opened by delcypher 47
  • New Boost.Python interface

    New Boost.Python interface

    This branch is now feature complete, see bottom of the thread for updates/discussions.

    Following the spirit of other repositories this is a pull request of an ongoing branch. The code is not ready for merge, but this pull request enables to have an ongoing discussion to guide the development.

    The rationale for this new branch is explained in the readme.text, basically current python bindings rely in a broken tool (swig), has too much spagethi (see __init__.py) and not enough link to the C++ codebase.

    The current status is: a) Proof of concept of the approach is valided (see d), b) ~~70%~~ ~~80%~~ ~~90%~~ all of the ground work is in place (~~most notable missing pieces is the gpu API, Tuple, RDom~~), c) Code compiles and runs, d) blur.py runs and blurs the image as desired (erode.py, bilateral_grid.py also works). ~~e) No real unit-tests in place (see Q4).~~

    On principle code should work perfectly fine on python 2 and 3; but I have only tested for python 3.

    Some of feedback I would like to have: Q1) Likelihood of a merge with master once API/tests coverage is good enough.

    Q2) Suggested strategy to better address the drift between Python bindings and C++ code base (i.e. how to include in continuous integration).

    Q3) Current code adds dependency with Boost.Numpy (which is not part of boost) for convenient I/O. I am considering including a copy into the repository, opinions on the cleanest way of handling the dependency?

    Q4) Suggestions for the best testing approach. For now I will focus on porting the demonstration apps from the old python bindings, and covering the areas I will be using; but I guess we could do better.

    Q5) Anyone interested on giving a hand ? Second brain helps make code cleaner. Also, I would not mind delegating the gpu API part (which I am not planning to use in the short term).

    opened by rodrigob 46
  • ABI issue with LLVM11 and D3D12 (__stdcall)

    ABI issue with LLVM11 and D3D12 (__stdcall)

    The following code snippet from d3d12compute.cpp:

    hCPU.ptr += i * descriptorSize;
    (*device)->CreateUnorderedAccessView(NULL, NULL, &NullDescUAV, hCPU);

    produces the following assembly

    LLVM 10                                               LLVM 11
       0x2c619a06bf1:       mov    %r14,-0x1d0(%rbp)         0x19f89f06c01:       mov    %r14,-0x1e0(%rbp)
       ...                                                   ...
       0x2c619a06c09:       mov    %rcx,-0x220(%rbp)         0x19f89f06c19:       mov    %rcx,-0x1d0(%rbp)    <------
                                                             0x19f89f06c20:       lea    -0x1d0(%rbp),%rcx
       0x2c619a06c10:       mov    %rcx,0x20(%rsp)           0x19f89f06c27:       mov    %rcx,0x20(%rsp)
       ...                                                   ...
                            callq  *%rax                                          callq  *%rax

    On the left, the value of hCPU.ptr is in %rcx, which is then placed on the stack (0x20(%rsp)) to serve as an argument to the subsequent callq to CreateUnorderedAccessView. This is the correct ABI behavior.

    On the right, LLVM11 decided to replace the value of %rcx with the address of that value (notice the lea instruction). This is basically turning something that should have been passed by-value into something that is passed by-reference

    In short, LLVM 11 is insisting in passing the struct by reference, even though it fits entirely on a 64bit word and should have been passed by-value:

    typedef struct D3D12_CPU_DESCRIPTOR_HANDLE {
        SIZE_T ptr;
    opened by slomp 45
  • Update CMake to use modern features in testing.

    Update CMake to use modern features in testing.

    This is something of a big PR, but it only touches the CMake build, with one notable exception: apps/support/cmdline.h has been modified to (crudely) support non-RTTI builds.

    The main contribution is this: pseudo-targets for running tests have been removed in favor of using CTest, which has a number of advantages. Notably, it has a native notion of test labels, which allow us to define and select groups of tests to run. It also allows developers to run only those tests which failed previously; this is especially useful after an incremental build.

    The CMake minimum version requirement has been bumped from 3.3 to 3.14. A lot has changed in CMake in recent years and an initial effort has been made to modernize. For instance, the halide_use_image_io function has been replaced with an alias target Halide::ImageIO to which a normal target can link. A subsequent PR will focus on modernizing the rest of the CMake build so we can be more easily integrated into other projects, packaged by popular package managers, and installed into standard system locations.

    A large number of new lines list source files. Globbing source files in CMake breaks incremental builds and leads to frustrating scenarios where the lack of changes to CMakeLists after a pull requires a developer to do a full rebuild. The CMake developers strongly caution against source file globbing. It is the prevailing opinion on StackOverflow and in talks from the maintainers.

    Expect build failures while I learn how Travis works. Will also need to ask about the buildbots.

    opened by alexreinking 39
  • Atomics support

    Atomics support


    This is an attempt to add basic atomics support to Halide. Following the suggestions in https://github.com/halide/Halide/issues/1017, I added an atomic scheduling directive, such that you can write a schedule like this:

    hist(x) = 0;
    hist(im(r)) += 1;

    (also see test/correctness/atomics.cpp) I implement this by adding an atomic flag in the Store node, and emit corresponding backend code by checking the flag. LLVM, OpenCL, and CUDA backends are supported. Vectorized store is not supported yet. I didn't implement the coarser-grained locking mechanism discussed in the mailing list, as it requires more complicated backends. All atomics are implemented through an atomicAdd or an atomicCompareAndSwap loop. I don't have machines at the moment to test the OpenGLCompute and DirectX12 backends so I didn't implement atomics backends for them. The PTX backend mostly copied-pasted from the gradient-halide repo.

    As I mentioned in the issue, atomics are important, if not essential, for gradients computation, especially on GPUs. It would be nice if you can take a look when you have time.

    opened by BachiLi 37
  • Modernize CMake

    Modernize CMake

    This PR overhauls the CMake build to use modern features and improve our ability to be used/included by other projects. Note that a recent version of CMake (3.16+) is now required.

    New features:

    1. Clients now use find_package(Halide) to use Halide.
    2. Clients can alternately include Halide as a Git submodule and simply add_subdirectory our root. This also works with FetchContent, which automates cloning a git repo and downloading it to the build dir.
    3. Apps are built in the same way our clients would build them. Each one has an associated test that builds it and confirms that the app runs without crashing. See apps/lens_blur/CMakeLists.txt for an example.
    4. Python bindings are now built on all platforms.
    5. Tests build with precompiled headers.
    6. The new autoschedulers are packaged with the build and conveniently available from CMake.
    7. All of the tutorials are now built and tested (they weren't before!)

    Changes outside CMake:

    1. Documentation is handled by doc/CMakeLists.txt. There is no more Doxyfile.
    2. Tests no longer have include access to the whole source tree. They can see the runtime includes and the test/common includes only. The affected sources and the Makefile have been modified accordingly.
    3. OpenGL tests use system headers (as a consequence of the previous)
    4. Tests in the tests/ folder must now obey the following rules: a. Print "Success!" upon success unless it is an error or warning test. b. Print a message beginning with "[SKIP]" if the test will not run for some reason (e.g. no GPU).

    Breaking changes:

    1. halide.cmake and its associated functions are gone.
    2. Generator stubs are no longer supported from CMake.
    3. CMake 3.16+ required. This is the default version on Ubuntu 20.04 LTS and Visual Studio 2019.

    Fixes #870 Fixes #2400 Fixes #2643 Fixes #2821 Fixes #2852 Fixes #2942 Fixes #3658 Fixes #4009 Fixes #4284 Fixes #4476 Fixes #4581 Fixes #4890 Fixes #4893 Fixes #4895 Fixes #4943 Fixes #4948

    opened by alexreinking 35
  • AOT-generated code should include Argument information

    AOT-generated code should include Argument information

    It would be useful if AOT-generated Halide filters included a way to introspect the expected input and output arguments.

    Strawman proposal: Add simple array-of-struct data structures with names that match the filter. e.g.:

    struct HalideArgumentDescriptor {
      // Halide::Type isn't really available at runtime when running AOT;
      // we'll just sorta replicate it here.
      enum Type { kInt = 0, kUint = 1, kFloat = 2, kHandle = 3 };
      const char* const name;
      const bool is_buffer;
      const Type type_code;
      const uint8_t type_bits;
      const uint8_t buffer_dimensions;
      const double scalar_default;
      const bool has_scalar_minmax;
      const double scalar_min, scalar_max;
    struct HalideArguments {
      const int num_inputs;
      const HalideArgumentDescriptor* inputs;
      const int num_outputs;
      const HalideArgumentDescriptor* outputs;
    // If the filter we generate is like so:
    // extern "C" int my_awesome_filter(buffer_t* in1, buffer_t* in2, int16_t i, float f, buffer_t* out);
    // We'd generate something like:
    extern "C" HalideArguments my_awesome_filter_halide_arguments = {
        /* inputs */ 
          { "in1", true, kUInt, 8, 3, 0.0, false, 0.0, 0.0 },
          { "in2", true, kUInt, 8, 3, 0.0, false, 0.0, 0.0 },
          { "i", false, kInt, 16, 0, 0.0, false, 0.0, 0.0 },
          { "f", false, kFloat, 32, 0, 0.0, false, 0.0, 0.0 },
        /* outputs */
          {"out", true, kUInt, 8, 3, 0.0, false, 0.0, 0.0 },

    Since we'd just be adding a new extern name that no one is likely to ever be looking for, existing code should be unaffected. It's a POD of modest size so addition to code size should be unimportant.

    Specific layout and contents of the descriptor-struct open for discussion, of course. The one given above is similar to one I'm using in a private branch that suits my purposes.

    opened by steven-johnson 34
  • Prototype of multiple scattering update definitions

    Prototype of multiple scattering update definitions

    This is a prototype of being able to scatter to multiple coordinates in an update definition. You bundle the args and the values in a special intrinsic, which gets unrolled during lowering on a one-to-one basis for all the bundles that appear on the LHS and RHS (they must all have the same size and can't be nested).

    The advantage of this feature is that it lets you write an update definition that requires loading multiple values, doing something to them, and then storing all of them in one step. A good example is in the test - rotating a square image in place can be done by repeatedly loading four values, rotating them, and then storing them again to the same locations. I don't think we could currently express that algorithm.

    I'm opening this for discussion as a minimum viable version of the feature.

    I don't think the name Call::tuple is the right thing, and there's no sugar for this yet. Tuples are already a thing, and a Tuple has mixed types, which can't work here. Any other ideas for the name or for sugar for the feature are very welcome. "pack" maybe? "bundle"?

    opened by abadams 33
  • Need documentation for all of the Generator output files

    Need documentation for all of the Generator output files

    There isn't any good user-facing documentation that describes all of the outputs that a Generator can produce; as pointed out in https://github.com/halide/Halide/pull/7170, there really should be one. (Perhaps in the Doxygen for Generator or Module?)

    opened by steven-johnson 0
  • [RISC-V] Intrinsic has incorrect return type! (@llvm.vp.fcmp.nxv2f32, @llvm.vp.select.nxv2i32)

    [RISC-V] Intrinsic has incorrect return type! (@llvm.vp.fcmp.nxv2f32, @llvm.vp.select.nxv2i32)

    Hi, @zvookin! Trying RISC-V, got the following error on specific algorithm of Max pooling deep learning layer. Reproduced only with .vectorize applied.

    LLVM: 15.0.2

    #include "Halide.h"
    using namespace Halide;
    int main(int argc, char** argv) {
        Func top;
        Var x("x"), y("y"), c("c"), n("n");
        Buffer<float> input(1, 96, 55, 55);
        Halide::RDom r(0, 2, 0, 2);
        Halide::Expr kx, ky;
        kx = min(x * 2 + r.x, 54);
        ky = min(y * 2 + r.y, 54);
        Halide::Tuple res = argmax(input(kx, ky, c, n));
        top(x, y, c, n) = res[2];
        top.bound(x, 0, 27)
           .bound(y, 0, 27)
           .bound(c, 0, 96)
           .bound(n, 0, 1);
        top.vectorize(x, 8);
        // Scheduling
        Target target = get_host_target();
        target.vector_bits = 8 * sizeof(float) * 8;
        std::vector<Target::Feature> features;
        std::cout << target << std::endl;
        try {
            top.compile_to_static_library("compiled", {}, "pooling", target);
        } catch(Halide::InternalError& ex) {
            std::cout << ex.what() << std::endl;
        return 0;
    produce f0:
      for c in [0, 95]:
        for y in [0, 26]:
          for x.x in [0, 3]:
            vectorized x.v2 in [0, 7]:
              produce argmax:
                argmax(...) = ...
                for r4 in [0, 1]:
                  for r4 in [0, 1]:
                    argmax(...) = ...
              consume argmax:
                f0(...) = ...
    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2i32
    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2i32
    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2f32
    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2i32
    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2i32
    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2f32
    Internal Error at /home/dkurt/Halide/src/CodeGen_LLVM.cpp:632 triggered by user code at : Condition failed: !verifyFunction(*function, &llvm::errs()):

    If I use just top(x, y, c, n) = maximum(input(kx, ky, c, n)):

    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2f32
    Intrinsic has incorrect return type!
    ptr @llvm.vp.fcmp.nxv2f32
    Intrinsic has incorrect argument type!
    ptr @llvm.vp.select.nxv2f3
    opened by dkurt 2
  • Splitting statement (vector) into N independent Statments

    Splitting statement (vector) into N independent Statments

    I wanted to ask if there exists any transformation / subset of a transformation which can be used to split a vectorized statement in Halide IR, into N statements where each of the N statements computes the results over a contiguous non overlapping subset of the output. For example, given an input statement:

    result[0..64] = A[0..64] + B[0..64]

    we want to split it into 4 statements:

    result[0..16] = A[0..16] + B[0..16] result[16..32] = A[16..32] + B[16..32]

    The expressions themselves can contain other sun expressions which would have to be split in the same way.

    opened by RafaeNoor 2
C++ library for geographical raster data analysis

Pronto Raster library The Pronto Raster Library is a C++ library to work with raster data. The core idea of the library is to make raster data accessi

Alex Hagen-Zanker 43 Oct 5, 2022
Earth observation data cubes from GDAL image collections

gdalcubes - Earth observation data cubes from GDAL image collections gdalcubes is a library to represent collections of Earth Observation (EO) images

Marius Appel 71 Sep 27, 2022
Parallel-util - Simple header-only implementation of "parallel for" and "parallel map" for C++11

parallel-util A single-header implementation of parallel_for, parallel_map, and parallel_exec using C++11. This library is based on multi-threading on

Yuki Koyama 27 Jun 24, 2022
A thin, highly portable C++ intermediate representation for dense loop-based computation.

A thin, highly portable C++ intermediate representation for dense loop-based computation.

Facebook Research 125 Nov 24, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.8k Nov 26, 2022
A language and editor for scientific computation

Forscape A language and editor for scientific computation Focus on the problem, not the implementation details Forscape solves engineering problems wi

John Till 24 Dec 4, 2022
Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as task graphs that are scheduled concurrently and asynchronously on both CPUs and GPUs.

Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as tasks in a graph structure, where edges represent task dependencies

null 27 Nov 14, 2022
A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems

mpi-histo A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems. T

Raj Shrestha 2 Dec 21, 2021
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Nov 15, 2022
LibreSSL Portable itself. This includes the build scaffold and compatibility layer that builds portable LibreSSL from the OpenBSD source code.

LibreSSL Portable itself. This includes the build scaffold and compatibility layer that builds portable LibreSSL from the OpenBSD source code.

OpenBSD LibreSSL Portable 1.2k Dec 3, 2022
A easy to use multithreading thread pool library for C. It is a handy stream like job scheduler with an automatic garbage collector. This is a multithreaded job scheduler for non I/O bound computation.

A easy to use multithreading thread pool library for C. It is a handy stream-like job scheduler with an automatic garbage collector for non I/O bound computation.

Hyoung Min Suh 12 Jun 4, 2022
SecMML: Secure MPC(multi-party computation) Machine Learning Framework

SecMML 介绍 SecMML是FudanMPL(Multi-Party Computation + Machine Learning)的一个分支,是用于训练机器学习模型的高效可扩展的安全多方计算(MPC)框架,基于BGW协议实现。此框架可以应用到三个及以上参与方联合训练的场景中。目前,SecMM

null 84 Nov 25, 2022
A framework for generic hybrid two-party computation and private inference with neural networks

MOTION2NX -- A Framework for Generic Hybrid Two-Party Computation and Private Inference with Neural Networks This software is an extension of the MOTI

ENCRYPTO 15 Nov 29, 2022
Python Distlink - MOID computation

Python Distlink - MOID computation This is a Python wrapper for the C++ library distlink published by R.V. Baluev and D.V. Mikryukov used to compute t

Maximilian 1 Jan 24, 2022
Fast parallel CTC.

In Chinese 中文版 warp-ctc A fast parallel implementation of CTC, on both CPU and GPU. Introduction Connectionist Temporal Classification is a loss funct

Baidu Research 4k Nov 25, 2022
Parallel-hashmap - A family of header-only, very fast and memory-friendly hashmap and btree containers.

The Parallel Hashmap Overview This repository aims to provide a set of excellent hash map implementations, as well as a btree alternative to std::map

Gregory Popovitch 1.7k Dec 2, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Parallella 295 Nov 22, 2022
Peregrine - A blazing fast language for the blazing fast world(WIP)

A Blazing-Fast Language for the Blazing-Fast world. The Peregrine Programming Language Peregrine is a Compiled, Systems Programming Language, currentl

Peregrine 1.5k Nov 29, 2022
The Wren Programming Language. Wren is a small, fast, class-based concurrent scripting language.

Wren is a small, fast, class-based concurrent scripting language Think Smalltalk in a Lua-sized package with a dash of Erlang and wrapped up in a fami

Wren 6.1k Nov 29, 2022